Massive amounts of data are stored on the Internet as HTML websites. Typically, the data is organized and presented in a manner that is easily understood by human readers. For example, in viewing a webpage of a website containing information describing suitable parking areas, a human reader is typically able to locate and to understand the pertinent information. The human reader skims the webpage for context clues or labels that point to or suggest the location of the pertinent information. The human reader locates the pertinent information even if the information is contained in a table or mixed within blocks of texts. After locating the pertinent information, the human reader records the information or makes a mental note thereof so that the pertinent information can be used at the appropriate time, such as when navigating to a desired parking area. As set forth above, the human reader is capable of skimming through a multi-paragraph webpage and locating the pertinent information in a matter of seconds; however, it is difficult for computers to identify efficiently the pertinent information from a webpage.
Information extraction (“IE”) refers to the process of using a computer to extract pertinent information from a website. The extracted information is then stored to a database of organized pertinent information that is easily accessible and searchable by other computers. Known methods of IE are either supervised or unsupervised. Supervised IE requires an engineer or technician to review the information extracted from a website and to manually determine if the information is desirable. That is, the engineer or technician manually discriminates between unuseful or uninteresting information and useful or interesting information. The engineer causes the computer to store the useful, interesting, and/or informative information (hereinafter collectively “informative content”) to a database and to discard the unuseful or uninteresting information by creating a set rules or training examples for the computer to follow. Some of the rules and training examples may be specific to the extracted information of only a single website or webpage; whereas, other rules may have a more global usage, such that over time the computer may become more efficient at identifying the informative content. Unsupervised IE does not require an engineer or technician to create rules for determining if the extracted information is useful or interesting. Instead, a computer engaged in unsupervised IE performs statistical analysis over the extracted information to identify the informative content and outputs the desired data in database table form. Since unsupervised IE requires little to no human intervention, it is typically faster and more efficient than supervised IE.
Unsupervised IE is typically less accurate than supervised IE. Typically, known systems performing unsupervised IE generate “false positives,” which are data that a human would consider unuseful or uninteresting but that the computer determined to be useful or interesting. The accuracy of the system is reduced when the system stores false positives to a knowledge base of informative content. Moreover, if the informational content of a website is not presented in a manner that conforms to the statistical analysis approach applied by the computer, then the computer may not properly extract and organize the informative content.
Unsupervised IE has the potential to more efficiently add informative content to knowledge bases. However, there is a continuing need to increase the precision and recall of unsupervised IE. Thus, further developments in the area of unsupervised IE are desirable.