It is often a difficult task for computing systems to receive a document or other content and to determine a meaning or other content of the document. For example, while it may be straightforward to determine individual words within the document, it is often difficult to determine (with a given degree of certainty) a context of a given word, or relationships between words which impart meaning to the document as a whole. For example, it may be straightforward for a computing system to determine that a document contains the word “bank.” However, it may be more problematic for the computing system to determine whether the word “bank” in the document refers to, e.g., a financial institution, a bank of a river, or a turning of an airplane; all of which may be referred to using the term bank (or variations thereof).
Nonetheless, it is known that such information about the meaning or content of a document may potentially be very useful with respect to use of the document. For example, advertisers may wish to know about the content of a document, so as to more accurately and more meaningfully place their advertisements within content-related documents. For example, a financial institution may wish to place an advertisement within a document using the word bank in the context of finance, but not within one of the other contexts just referenced above. Consequently, computing systems and applications have been developed for determining a content, context, or meaning of documents, e.g., for the purpose of providing advertisements within such documents, or otherwise benefitting from knowledge about the content or meaning thereof.
One such technique may be referred to as taxonomic classification. In taxonomic classification, a taxonomy related to a particular topic or context is developed which includes a plurality of hierarchical categories, e.g., in a tree structure. For example, a taxonomy related to automobiles may include a first level categorizing automobiles as used or new. A level lower in the hierarchy of categories may distinguish each of the above categories as foreign or domestic cars, and lower levels may continue to branch into further defining characteristics of cars, including, e.g., a make, model, price or other feature of cars that may be associated with the taxonomy.
Then, in taxonomic classification, a generally large set of known documents may be considered, parsed, or otherwise analyzed to apply the hierarchical categories (and/or other features of the taxonomy) as labels to individual documents (or portions thereof) from a set of documents. For example, a number of human readers may be employed to read each of the individual documents within the set of known documents, and to apply categories or other features of the taxonomy as labels to individual elements of the document. For example, a human user may read a document and identify the word “civic” and may, if appropriate, associate the word “civic” with a model of the automobile Honda Civic, where, as just referenced, such an automobile model may be a category within the hierarchy of categories of the associated automobile taxonomy. Consequently, the so-labeled document may be categorized or labeled with respect to the automobile taxonomy, and not with reference to, for example, a civic duty of a citizen, or other meaning.
When all of the documents of the known set of documents have been appropriately labeled as just described, then the resulting set of labeled documents may be referred to or known as a “golden set,” or a “training set.” Known techniques exist for analyzing such a training set to determine a classifier model. Such a classifier model, in general, represents rules or other criteria which are derived from the labeled documents. For example, such a classifier model may include a set of rules which, for each labeled word or term, considers other factors, such as a proximity of the labeled word to other words within the document, and assigns a probability of the word in the particular context as having one or more meanings within that context. Then, a taxonomic classifier may be used to receive or otherwise determine a new document which is not a part of the set of labeled or categorized documents, and to implement the classifier model in conjunction with the original taxonomy in order to classify the newly-received documents with respect to the taxonomy. Once that classification has occurred, the taxonomy classifier may be further configured to attach, insert or otherwise provide supplemental content which is thought to be related to the newly-received and now-classified document.
Although such techniques have proven very useful in classifying newly-received documents which would otherwise be difficult to classify with respect to the taxonomy, the use of, and need for, human users to read the original set of documents and assign labels to portions thereof to create the training set, as just described, represents a significant bottleneck in the classification process, and adds a large amount of delay and expense to the process as a whole. For example, it may take users days or longer to read each of the documents within the original/known set of documents, and each of the users may be compensated for his or her efforts. Further, whenever some element of the taxonomy or the set of documents changes, then the process must be repeated in whole or in part, which, again, may add significant delay and expense to the classification process as a whole. In particular, such changes may need to occur rapidly in order to keep up with changing content of the documents (e.g., when a new and very popular product or concept appears within the documents). Consequently, it may be problematic to implement taxonomic classification in an effective manner, and in a manner which is fast, inexpensive, and easily-updatable.