Accelerating Small Language Model via Quantization: A GPT-4 Guided Approach for Low-Resource Story Completion
Rakshit Dabral*,
rakshitdabral1@gmail.com
 
Dr. Archana Kumar**
Professor, HOD
(AI&DS)
archna.kumar@adgitmdelhi.ac.in
* Scholar, Btech (AI&DS) 4th Year
** Department of Artificial Intelligence and Data Science
Dr. Akhilesh Das Gupta Institute of Professional Studies, New Delhi
 
 
Abstract- This paper introduces Story Completer, a sophisticated and efficient engine for real-time children's story completion, extending the foundational work of the TinyStories project. While TinyStories demonstrated that small models (<10M parameters) can generate coherent narratives on simplified data, our work addresses the challenge of deploying high-quality, context-aware generative models on resource-constrained hardware. The core innovation is a hybrid architecture that synergizes the rich semantic knowledge of a large language model with the computational efficiency of a smaller one. We integrate pre-computed GPT-4 text-embedding-ada-002 vectors within a compact, 12-million-parameter decoder-only transformer, effectively distilling the contextual understanding of a massive model into a lightweight system. Our methodology involved training this custom model from scratch using a meticulous strategy designed to adapt the model specifically for storytelling.
A key contribution of this project is the optimization of the trained model for practical deployment on consumer-grade hardware, including low-end PCs and CPUs. After initial training, we applied post-training quantization, converting the model's weights from 16-bit floating-point precision (FP16) to 8-bit unsigned integers (uint8). This optimization yielded significant performance gains without a noticeable degradation in narrative quality.
Comparative analysis between the normal and quantized models demonstrates the effectiveness of this approach. The quantization process reduced the model size by 49.5%, from 1546.04 MB to 780.03 MB. Furthermore, it achieved a 55.4% speedup in average inference time, decreasing from 21.701 seconds to 9.671 seconds. This project provides strong evidence for an efficient paradigm in model design, where the distilled intelligence of larger models, combined with optimization techniques like quantization, can be leveraged to create smaller, faster, and highly capable specialized systems.
 
Index Terms-  GPT-4 Embeddings, Model Compression,Quantization , Real-Time Text Generation, Small Language Model