Employing Machine Learning Methods to boost the Medicare program Theft Finding: Resolving Category Bias with Synthetic Minority Over-sampling Technique
Mr. Siddesh K T 2 , Ganesh Maruti Damodar1
2Assistant Professor, Department of MCA, BIET, Davanagere
1 Student,4th Semester MCA, Department of MCA, BIET, Davanagere
ABSTRACT
Detecting healthcare fraud is a complex and continually evolving challenge, particularly due to the difficulties posed by imbalanced datasets. Traditional machine learning (ML) approaches have been widely explored in past research but often struggle with data imbalance. Techniques such as Random Oversampling (ROS) can lead to overfitting, SMOTE (Synthetic Minority Oversampling Technique) may introduce noise, and Random Undersampling (RUS) can result in the loss of critical information. To address these limitations, it is essential to enhance model accuracy through advanced resampling methods and improved evaluation metrics. This study introduces an innovative strategy for addressing data imbalance in healthcare fraud detection, focusing on the Medicare Part B dataset. Initially, the categorical feature "Provider Type" is extracted and used to increase minority class variety by replicating existing entries. Following this, a hybrid technique known as SMOTE-ENN—combining SMOTE with Edited Nearest Neighbors (ENN)—is implemented. This approach not only generates synthetic samples but also filters out noisy data, leading to a more balanced and cleaner dataset. We evaluate six different ML models using standard metrics such as accuracy, precision, recall, F1-score, and the AUC-ROC curve, with additional emphasis on the Area Under the Precision-Recall Curve (AUPRC) due to its effectiveness in imbalanced settings. Experimental results demonstrate that the Decision Tree classifier outperforms all others, achieving an exceptional 0.99 score across all evaluation metrics.
Keywords: Healthcare Fraud Detection, Imbalanced Data, Medicare Part B, SMOTE-ENN, Machine Learning, Data Resampling, Decision Tree, AUPRC, Classification Models, Synthetic Oversampling, Noise Reduction, Evaluation Metrics