Using XPaths of Inbound Links to Cluster Template-Generated Web Pages
Computer Science and Information Systems, Tome 11 (2014) no. 1.

Voir la notice de l'article provenant de la source Computer Science and Information Systems website

Template-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consuming task. Although there are methods to cluster Web pages according to their structural similarity, however, in most cases they are too computationally expensive to be applicable at Web-Scale. We propose a novel highly scalable approach to structurally cluster Web pages by employing XPath addresses of inbound inner-site links. We demonstrate the effectiveness of our method by clustering more than one million Web pages from many real world Websites in a few minutes and achieving >90% accuracy.
Keywords: Web data extraction, structural clustering, template-generated pages, wrapper induction
@article{CSIS_2014_11_1_a9,
     author = {Tomas Grigalis and Antanas \v{C}enys},
     title = {Using {XPaths} of {Inbound} {Links} to {Cluster} {Template-Generated} {Web} {Pages}},
     journal = {Computer Science and Information Systems},
     publisher = {mathdoc},
     volume = {11},
     number = {1},
     year = {2014},
     url = {http://geodesic.mathdoc.fr/item/CSIS_2014_11_1_a9/}
}
TY  - JOUR
AU  - Tomas Grigalis
AU  - Antanas Čenys
TI  - Using XPaths of Inbound Links to Cluster Template-Generated Web Pages
JO  - Computer Science and Information Systems
PY  - 2014
VL  - 11
IS  - 1
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/CSIS_2014_11_1_a9/
ID  - CSIS_2014_11_1_a9
ER  - 
%0 Journal Article
%A Tomas Grigalis
%A Antanas Čenys
%T Using XPaths of Inbound Links to Cluster Template-Generated Web Pages
%J Computer Science and Information Systems
%D 2014
%V 11
%N 1
%I mathdoc
%U http://geodesic.mathdoc.fr/item/CSIS_2014_11_1_a9/
%F CSIS_2014_11_1_a9
Tomas Grigalis; Antanas Čenys. Using XPaths of Inbound Links to Cluster Template-Generated Web Pages. Computer Science and Information Systems, Tome 11 (2014) no. 1. http://geodesic.mathdoc.fr/item/CSIS_2014_11_1_a9/