Spark Job Execution Time Prediction and Optimization using Machine Learning
Harish Kumar O , Jayesh , Goldaselia , and Ananthan.T.V4
3 Assistant Professor 4 Professor 1234Department of Computer Science, Faculty of Engineering & Technology,
Dr.M.G.R. Educational and Research Institute Chennai, India.
1 Harishkumar72022@gmail.com, 2 jayeshchokkalingam@gmail.com,
3tvananthan@drmgrdu.ac.in, 4goldselia@drmgrdu.ac.in
ABSTRACT
The performance of big data frameworks like Apache Spark is heavily influenced by runtime configuration parameters such as executor memory, driver memory, number of cores, and shuffle partitions. While Spark offers flexibility in tuning these parameters, identifying the optimal combination is a complex task, often requiring domain expertise and considerable experimentation. Inefficient configurations can lead to excessive execution time, underutilization of resources, and increased operational costs.
To address this, the project proposes a machine learning-based framework that predicts the execution time of Apache Spark jobs based on the user-defined configuration settings. Historical job execution data is collected through automated Spark jobs, with different configurations systematically varied. Features are engineered and used to train a Random Forest Regressor model, capable of estimating job execution time with high accuracy.
A user-friendly web interface is developed to allow users to input their desired Spark configurations. The trained model then provides near-instantaneous execution time predictions, enabling users to make informed decisions before executing resource-intensive jobs. The system not only saves time and computing resources but also democratizes access to performance tuning insights for both novice and experienced Spark users. Additionally, the modular design of the framework makes it adaptable to cloud environments and other big data platforms.
This project showcases the integration of machine learning with distributed data processing systems, leading to intelligent, automated performance optimization in data-intensive applications.
KEYWORDS
Apache Spark, Execution Time Prediction, Machine Learning, Random Forest, Big Data Optimization, Performance Tuning, Resource Allocation, Spark Configuration, PySpark, Web Interface