Web sites present information on various topics in various formats. A great amount of effort is often required for a user to manually locate and extract useful data from the web sites. Therefore, there is a great need for value-added services that integrate information from multiple sources. For example, such services include customizable web information gathering robots/crawlers, comparison-shopping agents, meta-search engines and news bots, etc.
To facilitate the development of these information integration systems, good tools are needed for information gathering and extraction. In situations where data has been collected from different Web sites, a conventional approach for extracting data from various web pages uses programs called “wrappers” or “extractors” to extract the contents of the web pages. These programs typically consider the information presented on a single document as single record of extracted results.
However, often, in web pages, the information to be extracted is placed in a structure that has a particular alignment. The structure forms repetitive patterns. For example, queryable or searchable Internet sites such as web search engines often produce web pages with large itemized match results that are displayed in a particular template format as multiple records/elements of information with identical structure and alignment.
The template can be recognized when, for each element of the web document, a string that indicates the appearance and category of the element can be determined. Repetitive patterns are formed where each pattern represents one record/element of information. The pattern may not always be exactly repeating and may have slight inconsistencies.
FIG. 1 is a diagram illustrating a sample Hypertext Markup Language (HTML) page that contains multiple informational records. Each record represents a separate job opportunity. The page contains repeated patterns. Many searchable web sites, like job posting sites, search engines, and shopping sites, also exhibit such repeated patterns since these sites usually extract data from relational databases and produce dynamic web pages with a predefined format style.
In order to extract data from pages such as the page shown in FIG. 1, record boundaries need to be identified. Each record needs to be treated as a separate piece of information.
Current approaches for identifying record boundaries suffer from some serious limitations. Many approaches require some form of human intervention or training data and, as a result, are not easily applied to large-scale tasks. Several approaches rely on the record-containing document being formatted in a known way, and, as a result, are inapplicable to documents that are not formatted in that known way or if the structure changes over time from that known way. For example, some approaches require the record-containing document to be an HTML document that conforms to a specified scheme. These approaches fail when applied to documents that are not in HTML, or which depart from the scheme even to a minor extent or if the HTML document changes in structure or alignment.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.