Web pages on the World Wide Web are becoming more complex to accommodate rapidly growing information needs. For example, many web pages contain a variety of information such as headline news, sports scores, market information, shopping information, and entertainment news. Much of the information displayed on these web pages may not be modified by users as most web pages use fixed templates to position and display the information at various locations on the web page. The information to be displayed in the various web pages is stored in relational databases before being presented as human-readable HTML documents. Mining this information to determine the underlying structure is helpful for use in searching other data records or web pages for similar information.
Currently, two techniques exist to search data records of web pages to reveal the underlying structure. The first technique consists of programming a tool to search a given web page or web site according to a pattern observed by a programmer. This technique requires a lot of user effort and is very difficult to scale to a large number of web pages from different domains. The second existing technique to search data records of web pages involves the automatic extraction of data records via search engines or programs. This technique suffers from numerous problems including unsatisfactory accuracy. Additionally, the automatic extraction technique must share the same schema while different applications based on these data record need different schemas.