- Download 25
- File Size 966.74 KB
- File Count 1
- Create Date 06/06/2024
- Last Updated 06/06/2024
RESEARCH ON WEB CRAWLER’S AND ITS PERFORMANCE VIA EXAMINATION OF SINGLE AND MULTI-THREADING BY USING ALGORITHMS
CHIRITHOTI MANEESHA1, G VENU GOPAL2
1PG Schoolar, Dept. of Computer Science and Engineering, PBR Visvodaya Institute of Technology & Science Autonomous, Affiliated to JNTUA, Kavali, SPSR Nellore, A.P, India-5242201.
2Associate Professor, Dept. of Computer Science and Engineering, PBR Visvodaya Institute of Technology & Science Autonomous, Affiliated to JNTUA, Kavali, SPSR Nellore, A.P, India-5242201.
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - These days, while thinking about how to build the World Wide Web, web crawling is a crucial idea. Developing a reliable crawler system that returns relevant and efficient search results for common queries is an ongoing need. On a daily basis, individuals encounter the issue of search results that include unsuitable or inaccurate responses. Therefore, it is crucial to find better ways to provide users accurate search results in a reasonable amount of time. Web users and analyzers may get less useful and perhaps irrelevant findings since not all sites can be visited in less time. One kind of robot that can access hyperlinks and explore online pages is called a web crawler, often called a spider. The documents that are traversed from a web page are gathered in a web pot. The process begins with the seed URL, which is where the user gets the web content. If there are any new links in the downloaded documents, it will delete them. The crawler verifies whether the user has already downloaded the file when the URL is deleted. A web crawler follows the links on a website to read the documents contained within. Developing a reliable crawler system that returns relevant and efficient search results for common queries is an ongoing need. On a daily basis, individuals encounter the issue of search results that include unsuitable or inaccurate responses. Therefore, it is crucial to find better ways to provide users accurate search results in a reasonable amount of time. We demonstrate an efficient method for developing a crawler that takes aspects into account in this project. Moreover, there are a lot of crawlers that visit the seed URL, read the pages, and then download them to add to search engine indexes. An issue arises when the crawler continues to access out-of-date websites or pages, even if they were downloaded on their prior visit. A lot of time, space, bandwidth, and network resources are wasted because of this. It is possible to retrieve pages using a variety of algorithms. This study proposes a new method for web crawlers that use clustering techniques: single and multithreaded web crawling and indexing algorithm. Therefore, you should try to make a good system with a revised policy for web crawlers in order to reduce the occurrence of these issues. After sorting websites into "frequently," "frequently," and "static" in the first scan, the crawler determines when it needs to crawl that same page again. Results from experiments demonstrate that, compared to current approaches, the suggested algorithm achieves optimum execution time. Building an intelligent crawler that can learn to improve the effective ranking of URLs using a focused crawler is the main focus of this project. Initially, links are crawled from specific Uniform Resource Locators (URLs) using a crawling algorithm. This allows users to perform hierarchical scanning for their respective web links.
Key Words: Web crawler, URL’s, Multi Threaded Crawling, Web Crawling Tree, Algorithms.