Information retrieval from the Internet can be an arduous task because of the immense volume of available data. A specific type of information retrieval is the retrieval of corresponding data from a plurality of target web pages. An example of such corresponding data is a data set including name, image and price of a product for comparison purposes. Manual extraction of such information becomes unfeasible when a large number of target websites is to be considered.
For this reason, automated information retrieval is desirable. Many automated solutions use the structure of the target web pages to retrieve such data. For instance, search algorithms using the document object model representation of a website are well-known. An example of such an algorithm is XPath, which may be configured using screen scraping techniques, in which on-screen user-selected data in a source web page is converted into a set of extraction rules for the XPath algorithm to apply to target web pages.
However, such algorithms have limited usefulness for instance when considering so-called dynamic web pages. A dynamic web page may be a web page having data that is periodically updated, e.g. a share price monitoring web page, or a web page that is generated from a template using user-defined parameters. An example of such a web page is a product page of an online store, where the product page is generated from a backend database using a template and user input.
The problem with such dynamic pages is that small variations in the content may cause algorithms such as XPath to fail to recognize a path to a data node of interest in the DOM tree representation of the target web page. Such small variations may for instance include the presence of an attribute in the DOM tree of a source web page from which the extraction rules are created that is absent in the target web pages, thus causing failure of the algorithm to recognize the match between the branches of source and target. Alternative solutions include extraction algorithms based on machine learning techniques, but this requires machine learning experts to design customized machine models, which is expensive. In addition, the accuracy of these solutions is often unsatisfactory, especially in business environments.