This invention relates to information extraction from a plurality of disparate information sources. More specifically this invention relates to automatic identification of field types and classification of record types in a source of electronically-readable textual information. A specific application is in the field of data mining of web pages accessible via the World Wide Web. Another specific application is data mining of electronic mail records, plain text documents or structured databases.
There is a need for a database with information which has been ordered where the source of information is not organized in a form which is directly translatable. Unlike database conversion engines, which can perform field to field translation, there is a need to extract and organize information found in text.
Heretofore, text information extraction engines have been able to extract information according to a standardized pattern from multiple sources (global extraction), or to extract information based on learned or manually developed regularities specific to a subdomain (local extraction).
Parameter estimation is used for pattern recognition. However, parameter estimation is often difficult due to lack of sufficient labeled training data, especially where numerous parameters must be learned, as is the case when learning statistical language models, document classifiers, information extractors, and other regularities used for data mining of text data.
Machine learning generally assumes that all training data and test data come from a common distribution. However, certain subsets of the data may share regularities with each other but not with the rest of the data. For example, consider product names on corporate web sites from all over the web. The product names on a particular web site may share similar formatting but have formatting differing significantly from product names on other companies' web sites. Other examples include annotations for patients from a particular hospital, voice sounds from a particular speaker, and vibration data associated with a particular airplane. These subsets are called localities.
Taking advantage of local regularities can help a learning method deal with limited labeled training data because the local regularities are often simpler patterns and can be described using fewer parameters.
Expectation maximization is a known technique for providing data with confidence labels. An example is reported by K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, in “Learning to Classify Text from Labeled and Unlabeled Documents” published in The Proceedings of the Fifteenth National Conference on Artificial Intelligence, (AAAI Press, 1998).
Work in this field has been reported by Sergey Brin entitled Extracting Patterns and Relations from the World Wide Web published in The Proceedings of the 1998 International Workshop on the Web and Databases, March 1998. Application was for extraction of authorship information of books as found in descriptions of the books on web pages. This work introduced the process of dual iterative pattern-relation extraction wherein a relation and pattern set is iteratively constructed. Among other limitations, the Brin approach employed a lexicon as a source of global regularities, and there is no disclosure or suggestion of formulating or “learning” site specific (“local”) patterns or even of an iterative procedure for refining site and page specific (local) patterns.
Agichitien and Gravano in “Snowball: Extracting Relations from Large Plain-Text Collections” dated Nov. 29, 1999, Riloff and Jones, “Learning Dictionaries for Information Extraction by Multi-level Bootstrapping,” Proceedings of the Sixteenth National Conference on Artificial Intelligence,” (1999), and Collins and Singer, “Unsupervised Models for Named Entity Classification,” represent other lexicon generators of the same general form as Brin.
Work by William W. Cohen of AT&T Research Labs entitled “Recognizing Structure in Web Pages using Similarity Queries,” Proceedings of the National Conference on Artificial Intelligence (AAAI ) (July 1999) http://www.aaai.org, also uses lexical-based approximate matching.
Other classification methods are known in the art and are briefly characterized here. The first is transduction. Transduction allows the parameters of a global classifier to be modified by data coming from a dataset that is to be classified, but it does not allow an entirely different classifier to be created for this dataset. In other words, a single classifier must be used, applying to the entire set of data to be classified, and the algorithm is given no freedom to select subsets of the data over which to define local regularities.
Another classification-related method is co-training. In co-training, the key idea is to learn two classifiers which utilize two independent and sufficient views of an instance. Although co-training does learn two distinct classifiers, both of these apply to the global dataset. Here again, the algorithm does not learn local classifiers, and it has no freedom to select subsets of data over which stronger classifiers might be learned.
Lexical-based approximate matching has been observed to lack sufficient matching accuracy and context sensitivity to be useful to formulate local regularities with accuracy as great as a desired level. Moreover, there has been no recognition of the significance of the differences in the scope of regularities or of the different types of regularities that can be learned based on the scope of regularities. For example, regularities that hold within a website do not necessarily hold across the entire World Wide Web. There is a need for a reliable mechanism for formulating site specific regularities using only a rational amount of training effort and resources.