TY - GEN
T1 - Random Forest Classifier based Scheduler Optimization for Search Engine Web Crawlers
AU - Meegahapola, Lakmal
AU - Mallawaarachchi, Vijini
AU - Alwis, Roshan
AU - Nimalarathna, Eranga
AU - Meedeniya, Dulani
AU - Jayarathna, Sampath
PY - 2018/2/8
Y1 - 2018/2/8
N2 - The backbone of every search engine is the set of web crawlers, which go through all indexed web pages and update the search indexes with fresh copies, if there are changes. The crawling process provides optimum search results by keeping the indexes refreshed and up to date. This requires an “ideal scheduler” to crawl each web page immediately after a change occurs. Creating an optimum scheduler is possible when the web crawler has information about how often a particular change occurs. This paper discusses a novel methodology to determine the change frequency of a web page using machine learning and server scheduling techniques. The methodology has been evaluated with 3000+ web pages with various changing patterns. The results indicate how Information Access (IA) and Performance Gain (PG) are balanced out to zero in order to create an optimum crawling schedule for search engine indexing.
AB - The backbone of every search engine is the set of web crawlers, which go through all indexed web pages and update the search indexes with fresh copies, if there are changes. The crawling process provides optimum search results by keeping the indexes refreshed and up to date. This requires an “ideal scheduler” to crawl each web page immediately after a change occurs. Creating an optimum scheduler is possible when the web crawler has information about how often a particular change occurs. This paper discusses a novel methodology to determine the change frequency of a web page using machine learning and server scheduling techniques. The methodology has been evaluated with 3000+ web pages with various changing patterns. The results indicate how Information Access (IA) and Performance Gain (PG) are balanced out to zero in order to create an optimum crawling schedule for search engine indexing.
KW - Change frequency
KW - Information access
KW - Optimum scheduler
KW - Performance gain
KW - Search engine indexes
KW - Web crawler
UR - http://www.scopus.com/inward/record.url?scp=85048465877&partnerID=8YFLogxK
U2 - 10.1145/3185089.3185103
DO - 10.1145/3185089.3185103
M3 - Conference contribution
AN - SCOPUS:85048465877
T3 - ACM International Conference Proceeding Series
SP - 285
EP - 289
BT - Proceedings of 2018 7th International Conference on Software and Computer Applications, ICSCA 2018
PB - Association for Computing Machinery
T2 - 7th International Conference on Software and Computer Applications, ICSCA 2018
Y2 - 8 February 2018 through 10 February 2018
ER -