No current lexicon could be expected to contain entries for every possible word of a language, given the dynamic nature of language and the creativity of human beings. Nowadays, this phenomenon has become even more challenging as new technologies develop faster than before. Updating lexicons (dictionaries) by hand whenever new words are found is almost impossible and, if possible, requires a lot of experts' time and effort.
Thus, inevitably, there always exist out-of-vocabulary (words which are not found in a dictionary) in documents. Especially, many domain-specific technical words as well as newly derived words, such as new compound words and morphological variations of existing words (by means of affixation), can be missing from a given lexicon. Some examples of real words that do not exist in most dictionaries are autoinjector, electrocardiography, eyedrop, remanufacturability, and website.
Words unknown to the lexicon cause a lot of problems especially to natural language processing (NLP) systems such as machine translation systems and parsers, because the lexicon is the most important and basic knowledge source for these applications. When a NLP application sees a word unknown to its lexicon, it either fails to process the document, or guesses information necessary to process the document. However, the guessing is usually not very correct, and thus the system produces a poor result.
There has been a great effort to address this problem, especially in the areas of POS (part-of-speech) taggers and speech recognition. However, different applications recognize the problem of out-of-vocabulary (OOV) in different perspectives and have different goals.
For POS taggers and parsers, which rely on lexical (syntactic) information about words, the goal is to guess the most plausible parts-of-speech of OOV in contexts based on the probability of an unknown word to co-occur its neighboring words. Dermatas and Kokkinakis estimated the probability that an unknown word has a particular POS tag from the probability distribution of words which occur only once in the previously seen texts. See “Automatic stochastic tagging of natural language texts” in Computational Linguistics, 21(2), pp 137-164, 1995.
More advanced POS guessing methods use leading and trailing word segments to determine possible tags for unknown words. Weischedel et al. proposed a POS guessing method for unknown words by using the probability for an unknown word to be of a particular POS tag, given its capitalization feature and its ending. See Ralph Weischedel, Marie Meeter, Richard Schwartz, Lance Ramshaw, and Jeff Palmucci. “Coping with ambiguity and unknown words through probabilistic models” in Computational Linguistics, 19(2), pp 359-382, 1993.
Eric Brill describes a system of rules which uses both end-guessing and more morphologically motivated rules in “Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging” in Computational Linguistics, 21(4), pp 543-565, 1995.
For speech recognition systems, an OOV word is either a word unknown to the system vocabulary or a word that the recognizer fails to recognize. The goal is to find the closest word (in terms of sound and meaning) to the OOV word from the system's vocabulary.
Character ngram-based statistical approaches have been used in word-level language processing such as spell correction and word segmentation. Angell, Freund and Willett describe a method of comparing misspellings with dictionary terms based on the number of trigrams that the two strings have in common, using Dice's similarity coefficient as the measure of similarity. The misspelled word is replaced by the word in the dictionary which best matches the misspelling. See “Automatic Spelling Correction Using a Trigram Similarity Measure” in Information Processing and Management, 19(4), pp 255-261, 1983.
Problems with the Prior Art
Previous prior art approaches have at least two problems.
First, the prior art does not permit the recognition and/or identification of valid words in any given natural language. For example, all forms of a word (morphologically changed and/or derived) may not be in a particular dictionary. Further, new words and/or “coined” words won't be in the dictionary database. This problem is particularly evident in technical subjects where new words are used to describe new technologies or advances in old technologies.
Previous approaches begin the process based on the assumption that the OOV words are just unknown to the systems' lexicons, but they are possible real words of the language. That is, these systems treat a new word website and invalid word strings such as adkfiedjfd or v3.5a in the same way. None of the previous works has tried to recognize possible new words of a language and provide a way to augment an existing dictionary, so that these words can be identified properly (as non-OOV) in the future.
Second, previous approaches have been embedded in application systems to protect the system from failing when they meet OOV words or to improve the performance of the system. There is no stand-alone automatic system to find possible real words of a language and to acquire lexical information of the words.
Even though previous approaches aim at OOV problem, they were designed for specific applications. They guess the information of those words, needed for the specific applications, on the basis of the context in which these words appear. Thus, the information of a word may differ according to the contexts.