The present invention relates to the field of automated information retrieval in the context of document characterization and classification. Particularly, the present invention relates to a system and associated method for classifying semi-structured data maintained in systems that are linked together over an associated network such as the Internet. More specifically, this invention pertains to a computer software product for dynamically categorizing and classifying documents by taking advantage of both textual information as well as latent information embedded in the structure or schema of the documents, in order to classify their contents with a high degree of precision. This invention incorporates a structured vector model, and relies on a document classifier that assumes a structured vector model.
The World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. The phenomenal growth of the WWW has led to the proliferation of data in semi-structured formats such as HTML and XML. There is a pressing need to support efficient and effective information retrieval, search and filtering. An accurate classifier is an essential component of building a semi-structured database system.
Currently, users navigate Web pages by means of computer software programs/search tools that commonly fall into two broad categories: net directories and search engines. Net directories provide a hierarchical classification of documents based on a manual classification of Web page materials and data. Search engines use a keyword-based search methodology to return to the user a set of pages that contain a given keyword or words. Both search tools suffer from significant limitations. Net directories are precise but are very limited in scope and expensive to maintain, primarily because of the requirement for human effort to build and maintain them. Search engines are more capable of covering the expanse of the Web but suffer from low precision and in their current embodiments, are reaching their logical limits. Search engines may provide to the user a null return or, conversely, a multitude of responses, the majority of which are irrelevant.
A number of techniques have been applied to the problem. Among them: statistical decision theory, machine learning, and data mining. Probabilistic classifiers use the joint probabilities of words and categories to estimate the probability of a document falling in a given category. These are the so-called term-based classifiers. Neural networks have been applied to text categorization. Decision tree algorithms have been adapted for data mining purposes.
The problems associated with automated document classification are manifold. The nuances and ambiguity inherent in language contribute greatly to the lack of precision in searches and difficulty of achieving successful automated classification of documents. For example, it is quite easy for an English-speaking individual to differentiate between the meanings of the word xe2x80x9ccoursexe2x80x9d in the phrase xe2x80x9cgolf coursexe2x80x9d and the phrase xe2x80x9cof course.xe2x80x9d A pure, term-based classifier, incapable of interpreting contextual meaning, would wrongly lump the words into the same category and reach a flawed conclusion about a document that contained the two phrases. Another difficulty facing automatic classifiers is the fact that all terms are not equal from a class standpoint.
Certain terms are good discriminators because they occur significantly more in one class than another. Other terms must be considered noise because they occur in all classes almost indifferently. The effective classifier must be able to effectively differentiate good discriminators from noise. Yet another difficulty for classifiers is the evaluation of document structure and relative importance of sections within the document. As an example, for a classifier dealing with resumes, sections on education and job skills would need to be recognized as being more important than hobbies or personal background.
These and other language problems represent difficulties for automated classification of documents of any type, but the World Wide Web introduces its own set of problems as well. Among these problems are the following:
The challenges, then, are to deal with the problems inherent in all documents but to also deal with the special problems associated with Web documents, in particular those with a semi-structured format.
As noted, semi-structured data are data that do not have a fixed schema. Semi-structured data, however, have a schema, either implicit or explicit, but do not have to conform to a fixed schema. By extension, semi-structure documents are text files that contain semi-structured data. Examples include documents in HTML and XML and, thus, represent a large fraction of the documents on the Web.
The exploitation of the features inherent in such documents is a key to attaining and obtaining better information retrieval is not new. For example, one classifier has been designed to specifically take advantage of the hyperlinks available in HTML. Reference is made to Soumen Chakrabarti, et al., xe2x80x9cEnhanced Hypertext Categorization Using Hyperlinks,xe2x80x9d Proc. of ACM SIGMOD Conference, pages 307-318, Seattle, Wash., 1998.
In this manner, the classifier can evaluate for both and non-local data information to better categorize a document. However, there are more features of semi-structured documents that can be used for classification along with new techniques for evaluating the information gleaned from the documents.
Currently, there exists no other classifier that takes full advantage of the information available in semi-structured documents to produce accurate classification of such documents residing on the World Wide Web. The need for such a classifier has heretofore remained unsatisfied.
The text classifier for semi-structured documents and associated method of the present invention satisfy this need. In accordance with one embodiment, the system can dynamically and accurately classify documents with an implicit or explicit schema by taking advantage of the term-frequency and term distribution information inherent in the document. The system further uses a structured vector model that allows like terms to be grouped together and dissimilar terms to be segregated based on their frequency and distribution within the sub-vectors of the structure vector, thus achieving context sensitivity. The final decision for assigning the class of a document is based on a mathematical comparison of the similarity of the terms in the structured vector to those of the various class models.
The classifier of the present invention is capable of both learning and testing. In the learning phase the classifier develops models for classes with information it develops from the composite information gleaned from numerous training documents. Specifically, it develops a structured vector model for each training document. Then, within a given class of documents it adds and then normalizes the occurrences of terms.
The classifier further employs a feature selection technique to differentiate between good discriminators and noise and to discard noise terms on the basis of the structure the terms appear. It additionally employs a feature selection technique that determines the relative importance of sections of textual information. Once models for classes have been developed, the classifier can be used on previously unseen documents to assign best matching classes by employing a robust statistical algorithm.
To fully appreciate the characteristics and capabilities of the classifier it is first important to understand the basic characteristics of an XML (or other semi-structured) document and, further, to understand the concept of the extended model required to exploit the information encoded in them. XML documents differ from typical text documents in the following respects:
Proper classification of XML documents, thus, requires a scheme that exploits the rich information encoded in their structure. It is necessary to extend the notion of a document to incorporate the hierarchical sectioning of text. In an extended model, a document is hierarchically structured and text is embedded in the structure.
The hierarchical structure can be understood in the context of the analogy to a book. A book consists of many chapters, which, in turn, consist of many sections formed of many sentences, which, in turn, consist of many words. A word belongs to a sentence that contains the word, thus to the section that contains the sentence, the chapter that contains the section and, ultimately, to the book at its highest level. Thus, in a structured document a term (or equivalently a leaf or text) belongs to its antecedents. In the parlance of graph theory, the leaf belongs to its parent, its grandparent and all higher ancestors, ultimately belonging to the document or root.
The structure of the model is based on the following observation: Terms from the same XML element have to be grouped together to be treated in the same way, and to be differentiated from terms in other XML elements. The primary reason is that terms in one substructure may have a distribution that is different from another substructure or different distribution of terms from the overall document. By taking into account the structural information the classifier can achieve a context sensitivity that flat (unstructured) document models cannot achieve.
The algorithm used by the classifier may be summarized by the following process:
The semi-structured document classifier of the present invention provides several features and advantages, among which are the following:
The foregoing and other features and advantages of the present invention are realized by a classifier that takes advantages of the hierarchical nature of documents exemplified by those in XML (extensible Markup Language), or any other language whose structure is hierarchical in nature and includes tags with each element. The classifier presented herein uses the inherent structure of XML or other semi-structured documents to provide high quality semantic clues that may not be otherwise taken advantage of by term-based classification schemes. The classifier further relies on a robust statistical model and a structure-based context-sensitive feature for better classification