- Version
- Download 14
- File Size 421.97 KB
- File Count 1
- Create Date 23/04/2026
- Last Updated 23/04/2026
Health Insurance Premium Prediction System Using Machine Learning
Soham Khadse
Yash Wahane
SahilGangane
Dept. of Computer Science and Engineering
Jhulelal Institute of Technology
Nagpur,India sohankhadse532@gmail.com
yashwahane101@gmail.com
sahilgangane@gmail.com
Krish Durugkar
Dept.of Computer Science and Engineering
Jhulelal Institute of Technology
Nagpur,India durugkarkrish@gmail.com
Samir Sheikh
Dept. of Computer Science and Engineering
Jhulelal Institute of Technology
Nagpur,India samirsheikh@gmail.com
Prof. Rahul Bambodkar Dept.of Computer Scienceand Engineering
Jhulelal Institute of Technology
Nagpur,India r.bambodkar@jitnagpur.edu.in
Abstract—Health insurance premium pricing remains one of the most complex and consequential challenges in the global healthcare and financial services sectors. Premiums directly determine the affordability and accessibility of health coverage for individuals, families, and enterprises, while simultaneously dictating the financial viability and risk exposure of insurance providers. Despite its critical importance, the conventional process of premium determination relies heavily on rule-based actuarial tables and manual underwriting protocols that are rigid, opaque, and often inadequate in capturing the multidimensional nature of individual health risk. This paper presents a comprehensive machine learning-based Health Insurance Premium Prediction System that integrates demographic attributes, lifestyle indicators, geographic factors, and medical history variables to estimate insurance premiums in an accurate, transparent, and personalized manner. The proposed system trains and rigorously compares four supervised regression algorithms—Linear Regression, Decision Tree Regression, Random Forest Regression, and XGBoost Regression—on a real-world structured healthcare dataset of 1,338 records sourced from the Kaggle Medical Cost Personal Dataset.
Comprehensive preprocessing including missing value treatment, feature encoding, normalization, and feature Error (MAE) of 1,978 USD, and Root Mean Square Error (RMSE) of 3,312 USD on the held-out test set. SHAP (SHapley Additive exPlanations) value analysis is employed to interpret model predictions and quantify individual feature contributions, confirming that smoking status, age, BMI, and number of dependents are the dominant risk factors. Beyond prediction, the system incorporates a three-tier risk classification engine (Low, Moderate, High Risk) and is deployed as an interactive web application accessible to policyholders, insurance agents, and healthcare organizations. Future directions include integration of real-time wearable health data, federated learning for privacy-preserving distributed training, and deep learning architectures for longitudinal risk modelling.
Keywords—health insurance premium prediction, machine learning, supervised regression, Random Forest, XGBoost, SHAP explainability, risk categorization, actuarial pricing, healthcare analytics, feature engineering






