1. Field of the Invention
This invention relates to the field of classification systems, and in particular to the selection of the features and combinations of features that are used to determine a given sample's classification.
2. Description of Related Art
Consumers are being provided an ever-increasing supply of information and entertainment options. Hundreds of television channels are available to consumers, via broadcast, cable, and satellite communications systems, and the Internet provides a virtually unlimited supply of material spanning most fields of potential interest. Because of the increasing supply of information, entertainment, and other material, it is becoming increasingly difficult for a consumer to locate material of specific interest. A number of techniques have been proposed for easing the selection task, most of which are based on a classification of the available material's content, and a corresponding classification of a user's interest.
A number of methods are available for characterizing the content of a particular piece of material. In the entertainment field, television guides containing a synopsis of each program are available, and automated systems have been proposed for categorizing programs, and segments of programs, based on an analysis of the images contained in each image frame. In the information field, web crawlers are used to extract key words and phrases from each web page to facilitate the search for material based on such key words or phrases, or synopses of select web pages are manually created to form an index to facilitate these searches. In like manner, speech recognition techniques may be employed to create an index of key words used in a television or radio program, or in the lyrics of a song, and so on. Other characterization methods are also employed based on other factors as well. For example, the time of day, day of the week, and season of the year may be included in the characterization of broadcast entertainment material, distinguishing, for example, between “prime time” programs and “before dawn” programs, as a potential indicator of program quality or popularity. The producer, director, actors, broadcast network, type of provider, and so on, may also be used to characterize a program. In the information field, similar parameters may also be used, such as the number of “hits” a particular web page experiences per day, the number of other web pages that reference this web page, the author of the web page, and so on.
For ease of reference, the term “content material” is used hereinafter to refer to material that is related to the contents of information items, entertainment items, and other items that are potentially available for classification or characterization. The content material may include the contents of the information or entertainment item itself, an abstract or synopsis of the item, information related to the creation or presentation of the item, and so on. The term “feature” is used hereinafter to refer to a characteristic that is potentially available to facilitate the classification or characterization. For example, each word in a synopsis of a television program is a feature that can be used to facilitate the characterization of the content material of that television program; the director's name is also a feature, as is the time of day that the program is broadcast. In like manner, each key word of a web page is a feature, as is the provider of the web page, the family of pages to which this page belongs, and so on.
The effectiveness and efficiency of a classification system is highly dependent upon the choice of features used to classify the content material. This effectiveness and efficiency is particularly dependent upon the choice of features that comprise a combination of features. The choice of features that comprise a combination of features is often a subjective choice, and is often a manually intensive process. For example, it is straightforward to use the words of a synopsis as the set of features that will be used to classify a television program. Each synopsis is processed to identify each word and to remove noise words. The resultant list of words used in the synopsis, potentially ordered by their frequency of occurrence, are stored in a database for subsequent processing to determine the subject matter classification for that content material, or to determine whether these words are correlated with words that are related to a user's preference, and so on. Not every word, however, is equally effective in distinguishing among programs of different classifications. Some words, for example, may have a high frequency of occurrence in programs, regardless of the program's classification. Other words may have a low frequency of occurrence, but when they appear, are highly effective for distinguishing between program classifications. Evolutionary algorithms, discussed below, have been demonstrated to be particularly effective for determining the combination of features that provide a high degree of distinction among programs of differing classifications. In a traditional evolutionary algorithm, a chromosome is formed that contains combinations of features, in the above example, the chromosome-would contain a subset of all the words used in the synopses of many programs. Different chromosomes would contain different subsets. If a particular set of words is effective in distinguishing programs, each chromosome that contains these words in its subset of words will generally exhibit a better classification performance than a similar chromosome with fewer of these particular words, whereas the presence or absence of words that are common to a variety of classifications will not significantly affect their chromosomes' classification performance. By continually evolving alternative chromosomes based on the performance of prior chromosomes, with a preference for the evolution of chromosomes having traits (subsets of words) similar to those of the better performing prior chromosomes, the performance of the evolved chromosomes can be expected to increase. At the end of the evolutionary process, a single chromosome, or subset of words, is selected as the best performing set of words for distinguishing among program classifications.
The need for a selection of a set of features that provides an effective and efficient means of characterizing or classifying content material is particularly important as the resources available for such characterizing or classifying become limited. For example, as technologies become available, viewers will expect their newly acquired home entertainment systems to provide program selection assistance, based on a “preferences” profile. These systems, however, will typically contain limited processing and storage capabilities, and may not, for example, be able to store every word and phrase of every synopsis available for such selection assistance. The inclusion of a non-discriminating word in the limited storage will be wasteful, and, more significantly, may also decrease the classification accuracy by introducing false distinctions. Thus, a classification system must be effective in the dual task of selecting effective discriminating features and excluding counter-productive non-discriminating features, and, in general, the effects of including or excluding a feature are non-additive.