1. Field of the Invention
The present invention generally relates to the field of automatic document categorization.
2. Prior Art
While the present invention has application to a wide range of different environments, the preferred embodiment is directed towards online newspapers. The principal product of online newspapers—current news reports and current news articles—has become a commodity. To many users, a given online news site's rendition of today's news events is essentially interchangeable with that of any other site. Thus the publishers' leverage is limited, regardless of the quality of reportage.
This situation suggests the existence of major opportunities for the development of unique and valuable news products. Based on extensive market research, the greatest opportunity appears to be in the development of content products that center on the news, rather than in products and services that are more peripheral to the main activity of news sites. This allows sites to focus on their core competencies.
News sites excel at publishing reports of current news, and with the rapid growth in the number of such sites, competition has become fierce. More important, news consumers have become overloaded—there's simply too much news out there to properly absorb. Equally important, consumers are inundated with constantly changing reports of news events, to the point that they are nearly desperate for some kind of context to help them better understand the significance of these events.
As a result, context is often at least as important as content. If a publisher can add a meaningful context to existing content, significant additional value can be generated for the content-plus-context combination. If the publisher were to provide a context in the form of a report that is related to, for example, abortion, and it's basic form is a list of abortion-related articles, the initial appearance of the list—consisting mainly or exclusively of article headlines and summary paragraphs—contributes to the value of the entire collection.
Of all the potential news-centric products, the ones with the most potential for competitive differentiation, all depend on the existence of a news classification or categorization capability.
However, actually achieving effective categorization in a news environment requires overcoming some major challenges. To begin with, (1) there are a large number of topics—generally several hundred or more—that are required to categorize typical news stories at a reasonable level of detail. Also, (2) many of these topics are fairly close in definition—for example, in Iraq, there are topics relating to the occupation, to the oil reserves, to doing business in Iraq, to media issues, etc.
In addition, (3) the categorization must be accomplished in near real time. Typically, well over 200 articles must be categorized each day, often within a period of an hour or less. Plus, (4) new topics are constantly being added, and existing topics being redefined—single topics expanded into multiple topics, multiple topics merged into a single topic, and even more frequently, the definition of topics are simply refined.
And, finally, it is essential that (5) error rates be kept very low. False positive errors (relating to what's called “precision”) are particularly egregious, creating a lack of confidence in the overall news report. Similarly, though somewhat less detrimental, are false negative errors (relating to what's called “recall”); too many and users lose confidence that the system is covering all the news events of importance and relevance. Some types of errors are more noticeable in (very dynamic) daily news reports, while others become more prominent in news reports (more static in nature) covering a longer period of time.
The state of the art in the technology (which is generally referred to as automatic text categorization/classification) is advancing strongly, and the application of this technology to processing news articles has become a high priority. Manual classification is out of the question, given the large number of categories, the large number of new articles that must be classified, the near real-time nature of the required response, and the demand for classification consistency and accuracy.
Rule-Based Text Categorization
Initially, text categorization technology focused on rules-based approaches, which are similar to the query-based technology found in search engines. Employing queries provided by users, the search engine analyzes processed articles (usually stored in the form of indices), performing Boolean logic matches between the query's terms and the content terms in the processed articles.
Using rules-based technology in searching, a query is developed and manually refined until it retrieves precisely the documents desired. To effectively use conventional search technology for text categorization, a mechanism for storing and retrieving the query expressions (often referred to as “canned searches”) is required.
Once the query has been tested and refined, the documents are typically correctly included or excluded. But over time, a growing number of documents inevitably become incorrectly classified; they are either incorrectly included (false positives) or excluded (false negatives). This means that the output of even the most refined query must be constantly monitored to detect when errors begin to become significant. This monitoring can require a very substantial and costly effort.
The improperly classified documents could be, as appropriate, included or excluded by adding an include/exclude reference modifier to the query. With this approach, however, the augmented query (with a growing set of exclude/include conditions appended to it) quickly becomes very unwieldy. Once that happens, the query must be reformulated and tested and refined all over again so that it once again accurately captures the proper set of documents. Unfortunately, every time a query is refined, the whole set of not only current, but also past documents it retrieves must be manually re-examined and verified.
Instance-Based Text Categorization
To address these serious limitations of rule-based classifiers, a more advanced form of automated text categorization technology emerged, which is often called an “instance-based” or “machine learning” approach. Using the instance-based approach, the classifier analyzes a collection of documents that have been previous determined to belong to specified categories, and it develops an internal representation of the collection for each category. In essence, the instance-based classifier distills the common thread of meaning shared by the collection.
The instance-based classifier then compares new articles or documents against this internal collective representation (the “instance”) to determine whether the new article is related to this particular collection (characterized by a topic). A variety of different algorithms are employed, but they all have the common effect of computing how close the article being analyzed relates to the specific pre-classified collection. The most common learning-based algorithms include nearest-neighbor, neural nets, decision trees, naive Bayesian and structured vector machines.
A major problem with instance-based classifications, even though it is more advanced compared to rules-based technology, is that it presents a kind of Catch-22. Before categorization of new documents can proceed, there must exist and be available, a set of documents that have been previously (and accurately) categorized against each topic. The presumed existence of this “training corpus” begs the question of how it was arrived at in the first place.
Moreover, every time a new topic is introduced, a new training corpus must be obtained.
And, every time an existing topic is changed (a fairly frequent event in the news business) the previous training corpus must be discarded and replaced with a new collection of “training” documents that have been pre-classified to reflect that new topic definition.
The performance of instance-based automated text categorization strategies tend to drop significantly when the number of topics grows and, in particular, when the topic definitions become too similar (as is common with news-related topics). And finally, such technology tends to be computationally expensive and slow.
Text Categorization Performance
Regardless of the technical implementation used, classification system performance is most commonly defined by two parameters: precision and recall. Precision refers to the capability of the classification system to avoid mistakenly assigning a topic to a document unrelated to that topic. Recall refers to the capability of the classification system to identify all documents that are in fact related to the topic. Combined, precision and recall characterize the overall accuracy of the classification system.
Though a given classification system configuration will produce results with a specific precision and recall measurement, the levels of precision and recall can be adjusted by adjusting the classification system's configuration. Precision and recall are generally inversely related. As the classification system' configuration is changed, the precision/recall performance will typically follow a curve similar to that shown in FIG. 1. At the midpoint 102, precision and recall are approximately equal. At the lower point 103, recall has increased at the expense of precision, while at the upper point 101 precision has increased while recall has diminished.
Each classification system will produce its own precision-recall curve. Referring to FIG. 2, a simple rule-based system will have a performance curve 201 that is generally lower than the performance curve of an instance based system 202. If a more complex, structure-aware rule-based system, or an instance-based system with a larger training set, is employed, their performance curve 203 will generally show a greater performance.
The preferred embodiment of the present invention has as its aim to generate enhanced news content products based on the classification of news articles. Research conducted in support of this embodiment strongly suggests that for the enhanced news products to be acceptable to their intended market audiences, the classification of the news articles must be extraordinarily accurate. Specifically, this research suggests that the precision level should be greater than 99%; that is, less than one out of each hundred news articles is incorrectly classified. That same research suggests that the target recall should be greater than 97%.
Evaluation of existing automated document classification systems concluded that none of them were capable of achieving those levels of precision and recall, particularly on a sustained basis. As shown in FIG. 3, this gave rise to the present invention whose performance curve 304 not only exceeds the performance of prior art classification systems 301-303, but is sufficient to operate in the performance region 305 defined by the 99% precision and 97% recall goals.
Beyond precision and recall, there are other important factors that affect the practical usability of a classification system in different environments. These additional factors, which are explained in more detail below, include:                1. provision for multiple classifications;        2. provision for classification confidence levels;        3. preparation or training speed;        4. (for learning-based classifiers) sensitivity to training corpus errors;        5. ease in changing or adding a classification topic;        6. ease of detecting and correcting classification errors; and,        7. classification throughput.        
Multiple classifications (1) refer to the capability of assigning more than one topic to a given document. This is an essential capability for many applications. However, when multiple classifications are supported, the classification system is well served by a method of assessing the significance of each of the multiple classifications. Such a method is the provision of confidence levels (2).
Confidence levels are relative indicators on the confidence that can be placed on the decisions of the classification system. Confidence levels typically from 0.0 to 1.0 inclusive, with 0.0 indicating the lowest confidence level and 1.0 representing the greatest confidence. Most classification systems do not provide confidence levels associated with the classification outcomes.
Preparation or training time (3) refers to the time necessary to generate the classifiers using either rules or training data.
When employing learning-based classification systems, a significant factor pertains to the amount of errors in the training data (4) used to generate the internal classifiers. Deficiencies in training data produce “noise” in the resultant classification decisions that can be sufficient to seriously detract from the performance otherwise achievable by the classification system.
In applications where the topic set upon which the classifications are made, changes frequently, the classification system's ability to adapt to such changes (5) can become critical. In the preferred embodiment, the flow of newly published news articles can overwhelm classification systems designed for more static environments.
Ease of identifying and correcting errors (6) refers to the capability for monitors, human or machine, to determine that a classification error has occurred, and the capability to make corrections that not only change the specific error, but that prevent it from occurring again in the future.
Classification throughput (7) refers to the capability of a classification system to process the volume of documents it is presented with.
Thus, for the reasons set forth above, the standard classification systems aren't well suited for many applications, such as classifying news articles. What is needed is a classification system that is far more accurate (with a precision about 99% and a recall above 97%), which can make multiple classification assignments for each document, and which can easily adapt to changes in topic definitions as well as adding new topics.