Web scraping is crucial in the prevention of phishing attacks says Oxylabs

October 5, 2022

Existing phishing detection tools often fail to identify current threats due to a dependence on legacy databases of previously identified fraudulent websites. Web scraping can identify fraudulent websites that capture sensitive user data faster, effectively stopping phishers in their tracks. This is according to Aleksandras Šulženko, Product Owner at Oxylabs.

Millions of new users of varying ages and technical abilities are coming online each year. While many can easily recognise an online scam, fake email address, or fraudulent website, not everyone has the same skills or experience.

Phishing threats are on the rise having, grown exponentially in the last two years. A recent report from Statista showed that in the first quarter of 2021, 611,877 unique phishing sites were detected, dramatically rising from 46,824 in 2015. Astonishingly, the (APWG) Phishing Activity Trends Report for the second quarter of 2022 found 1,097,811 observed phishing attacks, the largest measured in the group’s history.

Aleksandras Šulženko stated: “Businesses may take action to fight phishing by scanning all incoming emails by employing proxy networks. They then visit all the URLs and scan attachments for any potential malicious content to ensure that only legitimate emails arrive to the intended recipients.

“Proxies provide anonymity, allowing cybersecurity professionals to evade detection. However, phishers are aware of that and routinely block IPs suspected of belonging to security companies. To address this issue, datacenter and residential proxies are deployed from varying locations to act as an intermediary to provide anonymity and bypass geolocation restrictions.”

Recent research has led to advanced web scrapers that rate the heuristics of both genuine and illegitimate websites with increased precision. Data collected using these applications are then analysed using a data mining tool to find patterns, report findings, and detect fraudulent websites more accurately.

Šulženko continued: “The phishing website detection framework is based on content-based heuristics generated from training data sets collected from active and previously detected phishing websites. Advanced web crawlers scrape relevant information from websites, and a machine learning model identifies heuristics typical to fraudulent websites. Following analysis, weights are calculated to produce a phishing factor that enables users to determine the probability that a website is illegitimate. “Scraping produces new data daily in addition to the numerous sources already available. Models can be trained on the labelled data available in public databases. Scraping would then bring the necessary test data to see whether ML models can perform well in real-world scenarios,” concluded Šulženko.