Various forms of media, such as articles, videos, and images, will each typically have some form of text associated therewith, either in the body of the document or elsewhere, that provides information as to the content of the media. Often times this media is stored and retrieved according to the associated text. Media can be more efficiently stored and retrieved when it is classified by a particular subject so as to reduce the time necessary to file and retrieve it. Prior storage systems have typically classified media into mutually exclusive categories such as news, sports, entertainment, etc. This is also commonly referred to as flat partitioning.
In order to classify media into these mutually exclusive categories or partitions, key vocabulary consisting of one or more terms from the text is captured and extracted. These terms are typically stemmed, in which the ending of a term is removed and only the root is maintained. Furthermore, when capturing key vocabulary, terms are often taken individually, not in a group or phrase. As a result, key vocabulary can lose its primary meaning. Based on these individual, stemmed terms, the media to which they are associated is classified into one of a plurality of flat partitions. However, improper or poor classification of media can occur depending on how vocabulary is extracted and if key vocabulary have lost their primary meaning due to stemming and other filtering.