Intelligent Semantic Retrieval Generation Web Crawling Chunking (RAG) System for AI





Find us on Google Scholar

Peer Review Policy
Article Processing Charges
Publication Procedure
Research Topics
FAQ
Copyright Infringement
Refund and Cancellation Policy

Find us on Google Scholar

Peer Review Policy

Article Processing Charges

Publication Procedure

Research Topics

FAQ

Refund and Cancellation Policy

Version
Download 99
File Size 437.95 KB
File Count 1
Create Date 14/11/2025
Last Updated 14/11/2025

Download

Description

Intelligent Semantic Retrieval Generation Web Crawling Chunking (RAG) System for AI

Mrs. Darshika Kelin AR Assistant Professor

Dept. of DSCS, Karunya University,

Coimbatore

darshika@karunya.edu

Dhanush Kumar.LS

U.G Scholar,

Dept. of DSCS, Karunya University,

Coimbatore

dhanushkumar22@karunya.edu.in

Abstract

In this paper, the complete and scalable Retrieval-Augmented Generation (RAG) system development is presented with a special reference to the trustworthy pipeline for data ingestion and processing. The method we propose provides a systematic approach to solving the problem of the need for current, outside information for large language models (LLMs) to reduce factual differences or “hallucinations.” An automated web crawling procedure is the starting point of the system for copying unstructured data from selected online sources. This raw data is further processed by means of careful cleaning to get high-quality, domain specific text and remove noise. The major component of our pipeline is a multi-phase data preparation technique. In the beginning, we apply sophisticated text chunking methods, such as token-based and sentence- based segmentation, to break down large documents into parts that are sensible and pleasant to read. After that, state-of-the-art embedding technique is employed to convert these chunks into number vector embeddings that convey their semantic meaning. By storing the resulting embeddings in a special vector database like Pinecone, FAISS, or ChromaDB, the similarity- based retrieval that is extremely efficient is made possible. The last stage involves the generative AI model, e.g.-GPT that makes it possible to dynamically retrieve the most relevant chunks by utilizing this vector store in response to the user’s query. Our Retrieval-Augmented Generation technique ensures that the model’s output is both contextually relevant and directly verifiable against the source documents. Our study provides a complete and replicable RAG framework that is an affordable and safe way to produce exact and authorized writing, suitable for a wide range of users from information management to conversational AI.

Keywords

Retrieval-Augmented Generation, Web Crawling, Chunking of Texts, Vector Embeddings, Semantic Search, Large Language

Intelligent Semantic Retrieval Generation Web Crawling Chunking (RAG) System for AI

Intelligent Semantic Retrieval Generation Web Crawling Chunking (RAG) System for AI

Why IJSREM?

Publication Time Period

Publication Procedure

Processing Fee's

Follow Us

Working Hours

Contact Us

Intelligent Semantic Retrieval Generation Web Crawling Chunking (RAG) System for AI

Intelligent Semantic Retrieval Generation Web Crawling Chunking (RAG) System for AI

What is DOI

Site Map

Frequently Asked Questions

Why IJSREM?

Publication Time Period

Publication Procedure

Processing Fee's

Follow Us

Working Hours

Contact Us