Random Forest Classifier based Scheduler Optimization for Search Engine Web Crawlers

Lakmal Meegahapola, Vijini Mallawaarachchi, Roshan Alwis, Eranga Nimalarathna, Dulani Meedeniya, Sampath Jayarathna

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Citations (Scopus)

Abstract

The backbone of every search engine is the set of web crawlers, which go through all indexed web pages and update the search indexes with fresh copies, if there are changes. The crawling process provides optimum search results by keeping the indexes refreshed and up to date. This requires an “ideal scheduler” to crawl each web page immediately after a change occurs. Creating an optimum scheduler is possible when the web crawler has information about how often a particular change occurs. This paper discusses a novel methodology to determine the change frequency of a web page using machine learning and server scheduling techniques. The methodology has been evaluated with 3000+ web pages with various changing patterns. The results indicate how Information Access (IA) and Performance Gain (PG) are balanced out to zero in order to create an optimum crawling schedule for search engine indexing.

Original languageEnglish
Title of host publicationProceedings of 2018 7th International Conference on Software and Computer Applications, ICSCA 2018
PublisherAssociation for Computing Machinery
Pages285-289
Number of pages5
ISBN (Electronic)9781450354141
DOIs
Publication statusPublished - 8 Feb 2018
Externally publishedYes
Event7th International Conference on Software and Computer Applications, ICSCA 2018 - Kuantan, Malaysia
Duration: 8 Feb 201810 Feb 2018

Publication series

NameACM International Conference Proceeding Series

Conference

Conference7th International Conference on Software and Computer Applications, ICSCA 2018
Country/TerritoryMalaysia
CityKuantan
Period8/02/1810/02/18

Keywords

  • Change frequency
  • Information access
  • Optimum scheduler
  • Performance gain
  • Search engine indexes
  • Web crawler

Fingerprint

Dive into the research topics of 'Random Forest Classifier based Scheduler Optimization for Search Engine Web Crawlers'. Together they form a unique fingerprint.

Cite this