Accelerating Small Language Model via Quantization: A GPT-4 Guided Approach for Low-Resource Story Completion





Find us on Google Scholar

Peer Review Policy
Article Processing Charges
Publication Procedure
Research Topics
FAQ
Copyright Infringement
Refund and Cancellation Policy

Find us on Google Scholar

Peer Review Policy

Article Processing Charges

Publication Procedure

Research Topics

FAQ

Refund and Cancellation Policy

Version
Download 8
File Size 467.85 KB
File Count 1
Create Date 29/10/2025
Last Updated 29/10/2025

Download

Description

Accelerating Small Language Model via Quantization: A GPT-4 Guided Approach for Low-Resource Story Completion

Rakshit Dabral*,
rakshitdabral1@gmail.com

Dr. Archana Kumar**
Professor, HOD

(AI&DS)

archna.kumar@adgitmdelhi.ac.in

* Scholar, Btech (AI&DS) 4th Year

** Department of Artificial Intelligence and Data Science

Dr. Akhilesh Das Gupta Institute of Professional Studies, New Delhi

Abstract- This paper introduces Story Completer, a sophisticated and efficient engine for real-time children's story completion, extending the foundational work of the TinyStories project. While TinyStories demonstrated that small models (<10M parameters) can generate coherent narratives on simplified data, our work addresses the challenge of deploying high-quality, context-aware generative models on resource-constrained hardware. The core innovation is a hybrid architecture that synergizes the rich semantic knowledge of a large language model with the computational efficiency of a smaller one. We integrate pre-computed GPT-4 text-embedding-ada-002 vectors within a compact, 12-million-parameter decoder-only transformer, effectively distilling the contextual understanding of a massive model into a lightweight system. Our methodology involved training this custom model from scratch using a meticulous strategy designed to adapt the model specifically for storytelling.

A key contribution of this project is the optimization of the trained model for practical deployment on consumer-grade hardware, including low-end PCs and CPUs. After initial training, we applied post-training quantization, converting the model's weights from 16-bit floating-point precision (FP16) to 8-bit unsigned integers (uint8). This optimization yielded significant performance gains without a noticeable degradation in narrative quality.

Comparative analysis between the normal and quantized models demonstrates the effectiveness of this approach. The quantization process reduced the model size by 49.5%, from 1546.04 MB to 780.03 MB. Furthermore, it achieved a 55.4% speedup in average inference time, decreasing from 21.701 seconds to 9.671 seconds. This project provides strong evidence for an efficient paradigm in model design, where the distilled intelligence of larger models, combined with optimization techniques like quantization, can be leveraged to create smaller, faster, and highly capable specialized systems.

Index Terms- GPT-4 Embeddings, Model Compression,Quantization , Real-Time Text Generation, Small Language Model

Accelerating Small Language Model via Quantization: A GPT-4 Guided Approach for Low-Resource Story Completion

Accelerating Small Language Model via Quantization: A GPT-4 Guided Approach for Low-Resource Story Completion

Why IJSREM?

Publication Time Period

Publication Procedure

Processing Fee's

Follow Us

Working Hours

Contact Us

Accelerating Small Language Model via Quantization: A GPT-4 Guided Approach for Low-Resource Story Completion

Accelerating Small Language Model via Quantization: A GPT-4 Guided Approach for Low-Resource Story Completion

What is DOI

Site Map

Frequently Asked Questions

Why IJSREM?

Publication Time Period

Publication Procedure

Processing Fee's

Follow Us

Working Hours

Contact Us