The present invention relates generally to analyzing and extracting information from web pages, and more particularly to automatically identifying and extracting desired information in web pages.
The World Wide Web (WWW) is now the premier outlet to publish information of all types and forms. Documents published on the web, commonly called web pages, are published using a language called HTML (or Hyper Text Markup Language), which sets standards for the formatting of documents. These standards make it possible for people to read and understand documents no matter which program they use for that purpose. For the most part, documents are designed and written to be read by real persons. But there is a growing need to have automatic programs extract certain parts of documents with minimal human intervention. For example, suppose that a document D contains information about product P. D may contain a picture of P, its description, its price, its availability and several characteristics of P. A different document D′, published by a different company about the same product P, may have similar parts, but they will most likely be arranged and formatted in a completely different way. People reading D and D′ can easily parse the information and understand its different pieces, but it is difficult for a computer program to so do without knowing in advance which pieces are included and how they are arranged. The same company that published the web page for product P may also publish pages on numerous other products. These pages may be similarly formatted, but since they describe different products they contain entirely different information.
As an example, a typical HTML document includes formatting commands or tags, and content which can be text, images, programs, and so on. HTML tags are enclosed in brackets < >. For example, the text “Product P available in California ON SALE for $19.99” can be formatted as:
<table><tr>Product P<img src=p.gif><i>available</i>in California<font color=red>ON SALE</font>for $19.99</tr></table>
This HTML code puts the line as a row in a table, adds an image, italicizes “available ”, and highlights “ON SALE ” in red. A typical commerce page may have hundreds of formatting tags.
A different product Q may appear as:
“Product Q in Oregon and Washington for $15.99” and be formatted as:
<table><tr>Product Q<img src=q.gif><i>available</i>in Oregon and Washington for $15.99</tr></table>
If one is interested in extracting only the price of the product, a typical rule-based extraction mechanism, using the first document for product P, may infer that the price appears after the ON SALE text, or after the red formatted text. However, this same extraction mechanism, when analyzing the second document for product Q, will miss the price of product Q, because neither the ON SALE text nor the red formatting is present. In general, the page may be much more complex and variable.
Accordingly, it is desirable to provide methods and systems for analyzing the structure of web pages and for automatically extracting pertinent information from the web pages.