Machine learning algorithms, such as artificial neural networks, decision trees, and support vector machines, are able to recognize patterns in a new data set after first being trained on a learning data set. Such algorithms have been used to filter spam emails, perform optical character recognition, counter credit card fraud, and understand the intent of a speaker from a spoken sentence. Generally, as the learning data set increases in size and the algorithm gains experience, its ability to correctly recognize patterns within new data improves. However, supplying a comprehensive learning data set may not be possible due to size and time constraints. In these cases, dictionaries can be used to augment the ability of a machine learning algorithm to apply previously learned patterns to new data.
Dictionaries can improve the performance of machine learning algorithms in a variety of ways. For example, a dictionary could simply contain data entries that were not initially present during the training phase, thus increasing the size of the effective learning data set. Dictionaries can also expand a machine learning algorithm's ability to recognize data as belonging to a class or category, aiding in pattern recognition. For example, a part-of-speech tagging algorithm can be trained to mark words in text as corresponding to parts of speech, such as nouns and verbs. Providing the algorithm with dictionaries of words classified as nouns or verbs allows it to recognize patterns learned from a limited initial data set to a much broader one.
Given their utility, dictionaries with many entries are preferred. Typically, entries are collected automatically through the use of automated scripts and programs, because manual compilation is time consuming. For example, in the fields of natural language processing and understanding, efforts have focused on extracting dictionaries directly from raw source data. One technique includes extending previous categorizations in manually annotated data to non-annotated data. Another technique includes extending manually constructed rules or templates over source data, such as pattern matching with regular expressions. The results are dictionaries containing entries captured by either the categorization or template. Researchers have also explored transforming preexisting databases into dictionaries using defined rules and heuristics.
One problem with automatically created dictionaries is the high level of entry duplication found in the created dictionaries. Entry duplication between dictionaries creates “noise.” This noise causes the algorithm to be unable to distinguish an entry as unique to a single dictionary. On a larger scale, duplication between dictionaries and the subsequent associated noise may hinder the utility of dictionaries for machine learning purposes. Previously, work has focused on improving the process of automatic dictionary creation to produce dictionaries with less noise. What is needed is an improved method for reducing noise in dictionaries.