Evaluation of Machine Learning Techniques for Identifying Phishing Emails: A Case Study with the Spam Assassin Dataset.
Samuel Twum1, Richard Sarpong2, Abraham Kwame Adomako3, Alpha Agusah4, Burma Poornima5 and Kadiatou Diallo6
1Department of Computer Application, Lovely Professional University, Punjab-India
samuel.12419294@lpu.in
2Department of Computer Application, Lovely Professional University, Punjab-India
sarpongrichard32@gmail.com
3Department of Computer Application, Lovely Professional University, Punjab-India
abrahamkadomako84@gmail.com
4Department of Computer Application, Lovely Professional University, Punjab-India
iamalphaagusah@gmail.com
5Department of Computer Application, Lovely Professional University, Punjab-India
poornimaburma@gmail.com
6Department of Computer Application, Lovely Professional University, Punjab-India
kadia.44ibmah@gmail.com
Abstract
Phishing attacks are a leading cybersecurity threat, which most commonly exploits the theft of sensitive user information through deceptive emails. Conventional heuristics- and blacklists-based spam filters struggle to keep up with the evolving tactics of cybercriminals. The present research provides a comparison of several supervised machine learning classifiers—Random Forest, Logistic Regression, Naive Bayes, and XGBoost—for their ability to identify phishing emails using the SpamAssassin dataset. Text normalization and TF-IDF vectorization methods are used for preprocessing the dataset. Then we evaluate the performance of every model against metrics like accuracy, precision, recall, and F1-score. Word clouds and ROC curves are some of the visualization methods also used. Additionally, a voting classifier is utilized to explore ensemble learning. The findings show that ensemble techniques and advanced models like XGBoost provide a robust performance suitable for real-world phishing detection systems.
Keywords: TF-IDF Vectorization, ROC Curve, Natural Language Processing (NLP), Ensemble Learning, Spam Assassin Dataset.