Distributed and Collaborative Web Change Detection System
Computer Science and Information Systems, Tome 12 (2015) no. 1
Search engines use crawlers to traverse the Web in order to download web pages and build their indexes. Maintaining these indexes up-to-date is an essential task to ensure the quality of search results. However, changes in web pages are unpredictable. Identifying the moment when a web page changes as soon as possible and with minimal computational cost is a major challenge. In this article we present theWeb Change Detection system that, in a best case scenario, is capable to detect, almost in real time, when a web page changes. In a worst case scenario, it will require, on average, 12 minutes to detect a change on a low PageRank web site and about one minute on a web site with high PageRank. Meanwhile, current search engines require more than a day, on average, to detect a modification in a web page (in both cases).
Keywords:
Content refresh, Incremental crawling, Crawling systems and Search engines
@article{CSIS_2015_12_1_a5,
author = {V{\'\i}ctor M. Prieto and Manuel \'Alvarez and V{\'\i}ctor Carneiro and Fidel Cacheda},
title = {Distributed and {Collaborative} {Web} {Change} {Detection} {System}},
journal = {Computer Science and Information Systems},
year = {2015},
volume = {12},
number = {1},
url = {http://geodesic.mathdoc.fr/item/CSIS_2015_12_1_a5/}
}
TY - JOUR AU - Víctor M. Prieto AU - Manuel Álvarez AU - Víctor Carneiro AU - Fidel Cacheda TI - Distributed and Collaborative Web Change Detection System JO - Computer Science and Information Systems PY - 2015 VL - 12 IS - 1 UR - http://geodesic.mathdoc.fr/item/CSIS_2015_12_1_a5/ ID - CSIS_2015_12_1_a5 ER -
Víctor M. Prieto; Manuel Álvarez; Víctor Carneiro; Fidel Cacheda. Distributed and Collaborative Web Change Detection System. Computer Science and Information Systems, Tome 12 (2015) no. 1. http://geodesic.mathdoc.fr/item/CSIS_2015_12_1_a5/