Detection of Data Manipulation in Datasets Using Machine Learning
RIMSHA ARFEEN1, RAJKUMAR KENCHA2, SRI LATHA ARIGE3,
MOHAMMAD ABDUL RASHEED4, P.BALAKISHAN5
1,4 UG STUDENT,CSE Department & Jyothishmathi Institute of Technology and Science
5 ASSOCIATE PROFESSOR, CSE Department & Jyothishmathi Institute of Technology and Science
Abstract - Data integrity is pivotal for achieving model performance and delicacy and for making believable opinions in moment's data wisdom and analytics environment. This design enforced and estimated a machine literacy- driven frame that can descry data tampering through a generative analysis of a structured dataset in its original and acclimated countries. By transubstantiating both datasets to match their structure, and calculating a point-full difference vector, the system estimated and linked possible tampering in the acclimated dataset through statistical analysis on named ordered features, similar as Interquartile Range( IQR), entropy analysis, and Original Outlier Factor( LOF). These named features were also drafted into a Random Forest classifier that directly labelled each record as either tampered or not tampered. The end product showed significant pledge in landing anomalies, similar as outliers, null inserts, mismatching types and subtle shifts in value. The results indicated high perfection and recall on a range of manipulated datasets. Through successive trial, the system is promising for data confirmation and examination and the expansion of forensic auditing systems. This result is modular and scalable, which gives the added benefit of sound data integrity in critical means like finance, healthcare, and defense.
Key Words: Data manipulation detection, data quality, anomaly detection, Interquartile Range (IQR), Local Outlier Factor (LOF), entropy analysis, skewness imputation, Shannon entropy, outlier detection, Random Forest, supervised classification, feature engineering, descriptive feature extraction, difference vectors, machine learning pipeline, validate data, data forensics, structured data comparison, ETL validation, automated dataset checking, and classification accuracy.