Identifying Webpage Regions and Their Roles by Combining Image Processing and Markup Analysis
MetadataShow full item record
Understanding what are the regions of a webpage and the functions of those regions is important for many services over web pages, including screen readers, web search, and assessing web-page similarity. In this thesis, we present an approach to identify the regions of a webpage based on image processing techniques and to identify the portions of the DOM tree corresponding to these regions. We then present and compare a rule-based approach and a SVM-based approach using the visual and markup information to classify regions based on their roles. A corpus of 150 web pages exhibiting a wide variety of designs was collected. Each page was provided human-assigned regions and their roles to use in training and for evaluating results. The segmentation algorithm accurately identified 77.8% of the 1222 web page regions in the corpus but its performance was not even across different types of regions. Segmentation accuracy was above 80% for headers, footers, body regions, and top navigation bars. The algorithm had more difficulty with left, right, and bottom navigation bars and dynamic content, having lower than 70% accuracy for locating these segments. The correctly segmented web page components were used as a test collection to compare the rule-based and SVM-based approach to assigning the role of each segment. The SVM-based and the rule-based approach both achieved between 74 and 75% accuracy over 951 classifications. The SVM-based approach was better at classifying left and bottom navigation bars while the rule-based approach did better at recognizing dynamic content. Moreover, an accuracy of 81.3% is obtained when we used both the methods to identify regions correctly. In this case, we considered a region correctly identified if the region is identified correctly either by the rule-based or SVM-based method. Overall, these results are promising for incorporating these segmentation and segment role classifications into web services.
Singh, Sanjeev Kumar (2014). Identifying Webpage Regions and Their Roles by Combining Image Processing and Markup Analysis. Master's thesis, Texas A & M University. Available electronically from