The automated extraction or mining of data from content sources is a growing industry. The extraction of data can be from either a structured content source, such as databases with structured content, or a semi-structured content source, such as a Web page on the World Wide Web (WWW), electronic mail messages, PDF and XML. The extraction of data from a semi-structured content source is more difficult than from a structured content source. This is because the data in a semi-structured content source does not conform to the formal structure of tables and data models associated with databases. Repetitive extraction of correct data from semi-structured content sources is difficult due to the structure, or lack thereof, of the data.
Due to the lack of structure, extracting data from a semi-structured content source requires the application of very creative techniques that cannot achieve the absolute perfection of the straight-forward techniques applied to well-structured data sources. Even if the data is being successfully extracted from a content source, there can still be changes in the content source and/or data that can bring about issues in subsequent extractions. One of the most common changes are those made to the Hypertext Mark-up Language (HTML) of a Web page. While there are differences in the fragility of certain extraction techniques, these changes to the HTML can ultimately prove to be significant enough as to cause the data extraction from a Web page to fail. Human monitoring of all subsequent extractions would be a massive undertaking. Therefore, in systems that routinely obtain or extract data from content sources, there is a need for a method and system of automating the quality monitoring and control of the data extracted.
As noted, the initial extraction of data from a semi-structured content source can be very difficult. Even if the initial data mining from a semi-structured source is successfully accomplished, the ultimate desire is to repeatedly obtain correct and well-structured data from a potentially changing semi-structured content source. Because there are no well-structured “landmarks” in the semi-structured content and the data is not necessarily embedded in the content in the desired form, the automated techniques used to get the data are heuristic in nature. That is, the data ultimately extracted may not actually be the data desired. Extracting incorrect data can cause errors in later processing steps of the information and/or could be transmitted to an unknowing user. The data extraction tool used to extract the data from the content source might not be configured to or be able to notify a system administrator that the data being extracted is not actually the data desired. Therefore, there is a need for a system and method for notifying or signally when the data extracted from a changing semi-structured content source is not the correct or desired data.
One example of a semi-structured content source is a Web page. Data extraction/mining from a Web page can be difficult. This is partially true because Web pages on the WWW are in a semi-structured format. Web pages are typically defined in HTML. Data mining is hindered by the fact that there is no defined structure for organizing the data on a Web page, and it is difficult to determine the Web page scheme as it is buried in the underlying HTML code. Additionally, a similar visual effect as defined by the Web page scheme can be achieved with different HTML features such as HTML tables, ordered lists or HTML tagging adding to the difficulty of data mining from a Web page.
Therefore, even if the initial extraction of data from a Web page is successfully accomplished, issues can arise in subsequent extractions. One such issue is when the Web page appearance, layout, or format of the data is changed. When the data on a Web page is altered, the data mining tool becomes susceptible to extracting data that is not desired or other extraction errors. This can lead to a multitude of other issues, including but not limited to receiving mislabeled data, receiving only portions of the desired data, or receiving none of the desired data at all from the Web page.
Therefore there is a need for a system and method for monitoring and controlling the quality of the data yielded from a content source, and specifically from semi-structured content source. In systems that routinely obtain or extract data from semi-structured content sources, there is a further need for a method and system of automating the quality monitoring and control of the data extracted. Finally, there is a need for a system and method for notifying or signaling when the data extracted from a content source is not the correct or desired data.