Evaluation of Machine Learning Techniques for Identifying Phishing Emails: A Case Study with the Spam Assassin Dataset

Download 25
File Size 479.91 KB
File Count 1
Create Date 19/04/2025
Last Updated 19/04/2025

Download

Description

Evaluation of Machine Learning Techniques for Identifying Phishing Emails: A Case Study with the Spam Assassin Dataset.

Samuel Twum1, Richard Sarpong2, Abraham Kwame Adomako3, Alpha Agusah4, Burma Poornima5 and Kadiatou Diallo6

1Department of Computer Application, Lovely Professional University, Punjab-India

samuel.12419294@lpu.in

2Department of Computer Application, Lovely Professional University, Punjab-India

sarpongrichard32@gmail.com

3Department of Computer Application, Lovely Professional University, Punjab-India

abrahamkadomako84@gmail.com

4Department of Computer Application, Lovely Professional University, Punjab-India

iamalphaagusah@gmail.com

5Department of Computer Application, Lovely Professional University, Punjab-India

poornimaburma@gmail.com

6Department of Computer Application, Lovely Professional University, Punjab-India

kadia.44ibmah@gmail.com

Abstract

Phishing attacks are a leading cybersecurity threat, which most commonly exploits the theft of sensitive user information through deceptive emails. Conventional heuristics- and blacklists-based spam filters struggle to keep up with the evolving tactics of cybercriminals. The present research provides a comparison of several supervised machine learning classifiers—Random Forest, Logistic Regression, Naive Bayes, and XGBoost—for their ability to identify phishing emails using the SpamAssassin dataset. Text normalization and TF-IDF vectorization methods are used for preprocessing the dataset. Then we evaluate the performance of every model against metrics like accuracy, precision, recall, and F1-score. Word clouds and ROC curves are some of the visualization methods also used. Additionally, a voting classifier is utilized to explore ensemble learning. The findings show that ensemble techniques and advanced models like XGBoost provide a robust performance suitable for real-world phishing detection systems.

Keywords: TF-IDF Vectorization, ROC Curve, Natural Language Processing (NLP), Ensemble Learning, Spam Assassin Dataset.

Evaluation of Machine Learning Techniques for Identifying Phishing Emails: A Case Study with the Spam Assassin Dataset