Intelligent Semantic Retrieval Generation Web Crawling Chunking (RAG) System for AI
Mrs. Darshika Kelin AR Assistant Professor
Dept. of DSCS, Karunya University,
Coimbatore
darshika@karunya.edu
Dhanush Kumar.LS
U.G Scholar,
Dept. of DSCS, Karunya University,
Coimbatore
dhanushkumar22@karunya.edu.in
Abstract
In this paper, the complete and scalable Retrieval-Augmented Generation (RAG) system development is presented with a special reference to the trustworthy pipeline for data ingestion and processing. The method we propose provides a systematic approach to solving the problem of the need for current, outside information for large language models (LLMs) to reduce factual differences or “hallucinations.” An automated web crawling procedure is the starting point of the system for copying unstructured data from selected online sources. This raw data is further processed by means of careful cleaning to get high-quality, domain specific text and remove noise. The major component of our pipeline is a multi-phase data preparation technique. In the beginning, we apply sophisticated text chunking methods, such as token-based and sentence- based segmentation, to break down large documents into parts that are sensible and pleasant to read. After that, state-of-the-art embedding technique is employed to convert these chunks into number vector embeddings that convey their semantic meaning. By storing the resulting embeddings in a special vector database like Pinecone, FAISS, or ChromaDB, the similarity- based retrieval that is extremely efficient is made possible. The last stage involves the generative AI model, e.g.-GPT that makes it possible to dynamically retrieve the most relevant chunks by utilizing this vector store in response to the user’s query. Our Retrieval-Augmented Generation technique ensures that the model’s output is both contextually relevant and directly verifiable against the source documents. Our study provides a complete and replicable RAG framework that is an affordable and safe way to produce exact and authorized writing, suitable for a wide range of users from information management to conversational AI.
Keywords
Retrieval-Augmented Generation, Web Crawling, Chunking of Texts, Vector Embeddings, Semantic Search, Large Language