The intensive growth of the World Wide Web (e.g., Web) as a widely accessible source of textually formatted data demands a way of organizing the plethora of documents available thereon into categories to simplify finding and accessing them. There have been two main methods of so organizing such Web-based text documents.
One method to organize Web-based text documents is by manual categorization. This is achieved by humans manually sifting through the documents to be organized, ascertaining their content, and so categorizing them. Manual categorization of Web-based text documents however may be problematic in certain instances.
Manual categorizing may sometimes seem laborious, tedious, time consuming, and accordingly, expensive. Further, given the high and growing speeds at which such documents are produced and other attributes of open ended Web-based text document generation, it may be difficult for humans performing this task to cope manually with the proliferation of new text documents on the Web.
Further, the results of a number of studies agree that manual document categorization may also sustain the subjectivity of human decision making in categorizing. Thus, even where the humans performing manual categorization are all experts in their fields, they may be prone to make nuanced decisions often quite different from each other. Even the same human may display nuanced differences in her/his categorization at different times due to physical and/or psychological factors such as fatigue, comfort, illness, mood, distraction, preoccupation, etc. Such subjective factors may result in categorization inconsistencies, possibly even errors and omissions.
Owing to the possible problems and impracticality of the manual system, techniques effectuating the automated categorization of Web-based text documents have become important and popular during the past decade of the growth of the Web and access thereto. Automated categorization offers improvement. Automated categorization may be applied with either pre-defined categories or unknown categories. With predefined categories, automated classification is a matter of learning a classification model for each category (e.g., class) from a labeled set of examples (e.g., training set) by use of a supervised machine learning algorithm. However, this automated technique is limited to such pre-defined categories, which may not always be practicable for full classification.
Beside needing to know the categories beforehand, automated categorization with pre-defined categories may be problematic because the training set needs to be created by labeling examples. The training set must also be sufficiently exhaustive to fully reflect the degree of variety within a category, which is often impracticable and self-limiting. This may especially be impractical with rich categories, such as documents constituting technical literature.
If the training set is not sufficiently exhaustive, omissions and errors may occur. For example, when an automated classifier using this technique finds a document that does not fully reflect the possibly insufficiently determined category attributes, it may fail to classify it as a part of the category, or it may mis-categorize it entirely.
Further, the level of training required for the mechanism to see and recognize patterns (e.g., to achieve effectiveness), may render this method also somewhat laborious, lengthy, and expensive. This may become especially difficult with documents constituting technical literature, and/or other particularly rich categories.
When categories are not known beforehand, automated categorization of text documents may be more difficult than the automated technique used with predefined categories. This owes to the fact that the categories themselves must be discovered from the document collection. Thus, automated techniques applied in a milieu of unknown categories must first discover the categories by use of an unsupervised machine learning algorithm (e.g., there is no training set). Only then may they begin further classification. This is extra work.
Conventional categorizing techniques attempt to group documents using the documents' vocabulary in its entirety and applying term-goodness criteria. However, this produces somewhat flat groupings that are effectively blind to the semantics of features, resulting in sometimes relatively meaningless groupings whose semantic coherence is at best non-obvious. The attempted solution conventionally applied to this issue is dimensionality reduction.
One example is latent semantic indexing. Unfortunately however, this solution is often not satisfactory; the problem sometimes carries over. Groupings thus formed still sometimes suffer meaninglessness and lack of obvious semantic coherence. Another example is using certain phrases and not others. While some improvement in grouping meaningfulness is occasionally obtained in this way, more often the resulting groupings lack sufficient specificity.
Thus, there exist numerous problems associated with the prior art methods for automatically categorizing documents where the categories are unknown, including the generation of coherent groupings and obfuscation of contextual relationships or other semantic coherence within groupings.