Data mining includes the extraction of potentially useful information from data, such as data in a database. Clustering of data is often used in data mining and is the classification of data or attributes into different groups, i.e., the grouping of data into clusters, such that the data in each cluster share a common trait. For example, data clusters allow searching to be performed more efficiently, since the cluster can be searched instead of each individual attribute, thus reducing the number of search operations.
In some computing systems, certain data clusters can be called “synonyms,” where a synonym can include a number of different data items that are all considered the same for search purposes or similar functions. The synonym can have a “root form” which is a default value of the synonym assumed when any of the associated data items are found. Synonyms can be useful in searching for and finding data that may not be an exact match to an input term. For example, searching for a particular name of a person will find exact matches to that name, and a synonym for that name can include variations of the name which can also be searched to find data related to the same person.
One standard way of utilizing synonyms in computing systems is to provide a synonym table that is a look-up table listing each root form word mapped to a cluster of words or data attributes (synonym words) associated with the root and all treated as having the same meaning. Typically, the known synonym words having the same meaning are pre-determined or pre-computed and stored in the synonym table for later use. When an input word is received, a matching synonym word or attribute is found by looking up the input word in the synonym table, which provides the root form word or synonym identifier.
One disadvantage with prior synonym use is that there exist synonym words for data that are non-obvious and/or difficult to pre-compute. For example, synonym words for a first name (root word) of Robert may be Bob, Bobbie, Bobby, Dobb, Rab, Rabbie, Robbie, Robby, Rob, Robard, Raibeart, Lopaka, and Lopeti, and not all of these variations may have been found or determined beforehand. Further, the formation and update of synonyms or other types of data clusters is typically performed at discrete times after all desired data is input, or at a time of a query, which can greatly slow queries made during that processing and potentially allows synonym data to be inaccurate or incomplete (drift) before updates are made.
In addition, the look-up table mapping a root to synonym words requires domain knowledge of the type of synonym so that accurate and complete lists of synonym words can be found for that type. For example, linguistics domain knowledge and techniques must be used to accurately find synonym words for a name or word, while other domain knowledge must be used to determine other types of synonyms such as numerical values. Furthermore, the storage of all the synonym words for a root can take an enormous amount of storage, since all known synonym words of each root are stored regardless of whether those synonym words are ever used, stored or searched by the system.
Accordingly, what is needed is an improved method and apparatus for forming and modifying data clusters (such as synonyms) that, for example, can update synonyms quickly and prevent drift in data accuracy, only requires the storage needed for synonym and attributes in use by the system, and/or requires no specific domain knowledge of the data. The present invention can address such needs.