Clustering is a statistical process that attempts to find common structures in a collection of items. In so doing, clustering separates the entire collection of items into discrete groups whose members have some common feature. Often, a threshold level of commonality is used to determine which items will be grouped together with a certain topic name. An item that does not satisfy the threshold either may be grouped with another cluster or forced to begin a new group. This process continues until all items have been considered.
Clustering is a common and especially helpful technique for organizing large collections of data. In the life sciences, clustering is used to catalogue various life forms, such as plants and animals, into species and subspecies categories. Also, clustering is widely used in information sciences to organize text and numbers. For example, where the collection of items are text-based documents, clustering may create groups of documents based on the commonality of individual words or phrases within the documents. This type of clustering may allow the grouping of “civil war”-related documents, for example.
For some time, numeric and document clustering had to be accomplished manually by human editors who reviewed and scored each item to determine where it would be catalogued. However, with the advent of the computer, automated grouping via clustering algorithms has made it easier to update clusters that require continual additions.
The recent advent of the Internet and electronic word processing has created an increased need for automated clustering of words and phrases. Specifically, Internet search engines, electronic thesauruses, and electronic spell checkers, for example, operate on short phrases or individual words. In the context of Internet search engines, a user inputs a short phrase or single-word query. The search engine then searches the Internet or a categorization of web sites, looking for web pages containing words or phrases similar to the query. Most search engines do not require the web page to contain exact matching content. However, prior art search engines are limited by the accuracy of the query that is inputted. For example, misspellings, missing quotations, and other related errors, often cause the search engine to return with no results or irrelevant results. Therefore, it would be beneficial to provide an automated clustering technique that finds commonality amongst single words or phrases, and places the words or phrases into discrete groups. In this way, the clusters may be used to provide alternative words or phrases to a user or directly to a search engine, for example.