Lightweight Phishing URL Detection Using Hybrid Lexical–Metadata Features: A Machine Learning Approach
Ayan Chaudhuri
Vellore Institute of Technology, Vellore
Abstract— One of the most significant forms of cyber attacks is phishing websites, which exploit human users to acquire sensitive information (e.g., passwords, banking info, or personal identifiers). Although there are several detection techniques to identify phishing links through thorough and extensive analysis based on content type; however, even techniques such as deep learning will require computational resources to perform the analysis, thereby slowing progress toward the actual implementation of these techniques for detecting phishing URLs in the real-world environment. This paper presents a new lightweight and efficient machine learning model for detecting phishing URLs using very few lexical and metadata features of the URL string, and does not require access to or rendering of the actual web page.
The complete dataset was constructed by obtaining verified OpenPhish (the most significant source of phishing detection across a large number of websites) links combined with Tranco (a legitimate domain name provider) domains. A variety of machine learning algorithms were compared - the Decision Tree algorithm, Random Forest algorithm, and Logistic Regression algorithm. Ultimately, the Logistic Regression performed best based on the evaluation criteria: overall model accuracy of 0.991, area under the receiver operating characteristic (ROC) curve score of 0.966, and an F1 score of 0.865. In addition, it achieved a favorable precision/recall tradeoff and demonstrated good computational performance when used in conjunction with human judgement to interpret results. Due to the logistic regression's simple, efficient, and robust nature as well as its strong ability to identify fraudulent activity, it is well suited for application to a variety of real-time security implementations, including as a web browser plug-in, email filter, or endpoint solution.
Keywords— Phishing detection, URL classification, lexical features, metadata features, machine learning, cybersecurity.