Lightweight Phishing URL Detection Using Hybrid Lexical–Metadata Features: A Machine Learning Approach





Find us on Google Scholar

Peer Review Policy
Article Processing Charges
Publication Procedure
Research Topics
FAQ
Copyright Infringement
Refund and Cancellation Policy

Find us on Google Scholar

Peer Review Policy

Article Processing Charges

Publication Procedure

Research Topics

FAQ

Refund and Cancellation Policy

Version
Download 66
File Size 505.36 KB
File Count 1
Create Date 12/12/2025
Last Updated 12/12/2025

Download

Description

Lightweight Phishing URL Detection Using Hybrid Lexical–Metadata Features: A Machine Learning Approach

Ayan Chaudhuri

Vellore Institute of Technology, Vellore

Abstract— One of the most significant forms of cyber attacks is phishing websites, which exploit human users to acquire sensitive information (e.g., passwords, banking info, or personal identifiers). Although there are several detection techniques to identify phishing links through thorough and extensive analysis based on content type; however, even techniques such as deep learning will require computational resources to perform the analysis, thereby slowing progress toward the actual implementation of these techniques for detecting phishing URLs in the real-world environment. This paper presents a new lightweight and efficient machine learning model for detecting phishing URLs using very few lexical and metadata features of the URL string, and does not require access to or rendering of the actual web page.

The complete dataset was constructed by obtaining verified OpenPhish (the most significant source of phishing detection across a large number of websites) links combined with Tranco (a legitimate domain name provider) domains. A variety of machine learning algorithms were compared - the Decision Tree algorithm, Random Forest algorithm, and Logistic Regression algorithm. Ultimately, the Logistic Regression performed best based on the evaluation criteria: overall model accuracy of 0.991, area under the receiver operating characteristic (ROC) curve score of 0.966, and an F1 score of 0.865. In addition, it achieved a favorable precision/recall tradeoff and demonstrated good computational performance when used in conjunction with human judgement to interpret results. Due to the logistic regression's simple, efficient, and robust nature as well as its strong ability to identify fraudulent activity, it is well suited for application to a variety of real-time security implementations, including as a web browser plug-in, email filter, or endpoint solution.

Keywords— Phishing detection, URL classification, lexical features, metadata features, machine learning, cybersecurity.

Lightweight Phishing URL Detection Using Hybrid Lexical–Metadata Features: A Machine Learning Approach

Lightweight Phishing URL Detection Using Hybrid Lexical–Metadata Features: A Machine Learning Approach

Why IJSREM?

Publication Time Period

Publication Procedure

Processing Fee's

Follow Us

Working Hours

Contact Us

Lightweight Phishing URL Detection Using Hybrid Lexical–Metadata Features: A Machine Learning Approach

Lightweight Phishing URL Detection Using Hybrid Lexical–Metadata Features: A Machine Learning Approach

What is DOI

Site Map

Frequently Asked Questions

Why IJSREM?

Publication Time Period

Publication Procedure

Processing Fee's

Follow Us

Working Hours

Contact Us