DUOINSPECT- Uncovering Duplicates through Powerful Model of Random Forest and XGBoost Classifier
Chaithanya A1, Dr.T. Vijaya Kumar2
1 Student, Department of MCA, Bangalore Institute of Technology, Karnataka, India
2Professor, Department of MCA, Bangalore Institute of Technology, Karnataka, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Duplicate question pair detection plays a vital role in improving information retrieval systems and enhancing user experience. In this paper, we present a comprehensive study on duplicate question pair detection utilizing the Quora dataset. We employed machine learning techniques, specifically Random Forest and XGBoost classifiers, to develop accurate models for identifying duplicate question pairs.
To improve the performance of the models, we introduced additional features to the dataset, augmenting the original data. By incorporating 22 extra features derived from the raw data, we aimed to capture more nuanced patterns and increase the models' discriminatory power. The Random Forest model achieved a significant improvement, with a performance boost to 89.4% accuracy compared to the initial 73% accuracy. The XGBoost classifier also showed promising results, achieving an accuracy of 73.4% initially and 79.2% after incorporating the additional features.
This paper serves as a valuable reference for researchers and practitioners interested in the field of duplicate question pair detection. The findings highlight the effectiveness of Random Forest and XGBoost classifiers in combination with additional features for improving accuracy in this task. The web application provides a practical and user-friendly tool for real-time duplicate question pair detection, offering potential applications in information retrieval systems, chatbots, and question-answering platforms.
Key Words: Question Pair Similarity; XGBoost, Random Forest, Machine Learning.