1. Field of the Invention
The present invention relates to content classification, and more particularly, a system and method for clustering content according to similarity.
2. Description of Related Art
Most content is now stored in digital form and accessible over networks. For example, Document Management Systems (DMS) provide repositories of documents that can be searched and accessed over a computer network. Most DMS implementations are within a domain, such as a company, and are used to store documents that can be categorized in a relatively narrow set of topics. For example, a law firm may have legal briefs and other legal documents stored in a DMS. Also, downloadable or streaming media content is available in various domains.
Of course, various repositories of documents and other content items can be accessed over the internet. The most common way of discovering content on the internet is through the use of search engines, which index the content and then provide links to the content in response to keyword or topical search queries. More recently, it has become popular to associate topical or other descriptive tags, from a set of tags, with content to facilitate content discovery and retrieval. The set of tags can be arranged in an ontology or other arrangement and applied to content in a manner which helps describe the content. Of course, the tags facilitate content discovery because indexing of the document is not required and the tags convey a sense of what the content is about in a semantic or topical sense. Ideally, the set of tags associated with a document represent a compressed or minimal description of the document, which serves to both associate the document with its most similar neighbors, and to discriminate it from others unlike it.
However, there are many limitations to developing a set of tags and associating tags with content. For example, different domains may use different sets of tags and tag arrangements. This may cause inconsistencies and even lack of interoperability between domains. Even within a domain with a predetermined tag arrangement, the sheer amount of content makes it difficult to apply tags in a meaningful manner. There are tools for automated tagging. However, such tools are limited and are not effective across broad spectrums of topics and content. Furthermore, tags alone may not accurately reflect the similarity of one item of content to one or more other items of content, when analyzing and forming groups of content.