The Internet hosts a plethora of web portals in diverse fields like e-commerce, boarding & lodging, and entertainment. Information on websites or web pages of such portals is often presented in a uniform format to give a uniform look and feel, or appeal, to the pages. This can be achieved by using scripts to generate the static content and logical structure (referred to as a template) of the pages, and a database to provide the dynamic content, such as pricing of products. Precise detection of the template can therefore be important for applications that automatically extract information from such sites or sources.
The template detection task can become more challenging when multiple entities like products and search results are presented in the form of records on a single page. If the structure of the records is strictly-continuous, i.e. information in every record is similarly formatted, existing nested pattern detection algorithms can suffice to extract precise information. However, the records do not always follow a strict structure/pattern, hence requiring the template detection mechanism to detect approximate patterns. This is because, although the structure of different records can be largely similar, their information maybe formatted slightly differently. For example, a product description in one record can be in plain text, while in another record the product description can have formatting tags like <B> and <I>. Further, optional information like presence of discount price in addition to the original price, or absence of a rating-image in a record where rating information was not available, can contribute to structural differences between two records within the same page. These factors, if not accounted for, can lead to ineffective, inefficient or low-recall extraction when attempting to extract multiple entities from a page. Accordingly, effectively detecting approximate patterns can be useful to enable generation of a more precise template.
Therefore, it is with respect to these considerations and others that the present invention has been made.