The present invention relates to the field of text classification. More specifically, the present invention relates to the grouping of words and phrases into clusters of related words and phrases.
Clustering is a statistical process that attempts to find common structures in a collection of items. In so doing, clustering separates the entire collection of items into discrete groups whose members have some common feature. Often, a threshold level of commonality is used to determine which items will be grouped together with a certain topic name. An item that does not satisfy the threshold either may be grouped with another cluster or forced to begin a new group. This process continues until all items have been considered.
Clustering is a common and especially helpful technique for organizing large collections of data. In the life sciences, clustering is used to catalogue various life forms, such as plants and animals, into species and subspecies categories. Also, clustering is widely used in information sciences to organize text and numbers. For example, where the collection of items are text-based documents, clustering may create groups of documents based on the commonality of individual words or phrases within the documents. This type of clustering may allow the grouping of xe2x80x9ccivil warxe2x80x9d-related documents, for example.
For some time, numeric and document clustering had to be accomplished manually by human editors who reviewed and scored each item to determine where it would be catalogued. However, with the advent of the computer, automated grouping via clustering algorithms has made it easier to update clusters that require continual additions.
The recent advent of the Internet and electronic word processing has created an increased need for automated clustering of words and phrases. Specifically, Internet search engines, electronic thesauruses, and electronic spell checkers, for example, operate on short phrases or individual words. In the context of Internet search engines, a user inputs a short phrase or single-word query. The search engine then searches the Internet or a categorization of web sites, looking for web pages containing words or phrases similar to the query. Most search engines do not require the web page to contain exact matching content. However, prior art search engines are limited by the accuracy of the query that is inputted. For example, misspellings, missing quotations, and other related errors, often cause the search engine to return with no results or irrelevant results. Therefore, it would be beneficial to provide an automated clustering technique that finds commonality amongst single words or phrases, and places the words or phrases into discrete groups. In this way, the clusters may be used to provide alternative words or phrases to a user or directly to a search engine, for example.
Text classification has become an important aspect of information technology. Present text classification techniques range from simple text matching to more complex clustering methods. Clustering describes a process of discovering structure in a collection of characters. The invention automatically analyzes a text string and either updates an existing cluster or creates a new cluster. To that end, the invention may use a character n-gram matching process in addition to other heuristic-based clustering techniques. In the character n-gram matching process, each text string is first normalized using several heuristics. It is then divided into a set of overlapping character n-grams, where n is the number of adjacent characters. If the commonality between the text string and the existing cluster members satisfies a pre-defined threshold, the text string is added to the cluster. If, on the other hand, the commonality does not satisfy the pre-defined threshold, a new cluster may be created. Each cluster may have a selected topic name. The topic name allows whole clusters to be compared in a similar way to the individual clusters or strings, and merged when a predetermined level of commonality exists between the subject clusters. The topic name also may be used as a suggested alternative to the text string. In this instance, the topic name of the cluster to which the text string was added may be outputted as an alternative to the text string.
More specifically, the invention provides a method, system and computer-readable medium having computer-executable instructions for clustering character strings. Each character string comprises a word or a phrase. The method comprises the steps of receiving at least one character string, and clustering a first character string with another character string into one or more groups, when the first character string satisfies a predetermined degree of commonality with one or more character strings in each of these groups. When the first character string does not satisfy the predetermined level of commonality with another character string, another group is created. The method also selects at least one of the character strings in each of the groups to be the group""s topic name. Selection of the topic may be based on a pre-designation or a frequency of the received character strings with the groups. The selected topic may then be outputted.
The invention may be used to suggest alternative words for text-based activity, like Web page searches and spell-checking applications. In the case of Web page searches, when a user enters a misspelled search term or a spelling variant such as a common abbreviation that sufficiently matches an existing cluster, the cluster""s topic may be searched instead of the misspelled query. In the context of spell-checking applications, commonly found in word processors, when the invention receives a user""s misspelled word, it may return the cluster""s topic, representing a collection of correctly spelled words.