The invention generally relates to natural language processing, and more particularly, the automatic transformation of the orthography of a text stream such as the proper capitalization of words in a stream of text, especially with respect to automatic speech recognition.
Capitalized word forms in English can be divided into two main types: those that are determined by where the term occurs (or, positional capitalizations) and those that are determined by what the term denotes (or, denotational capitalizations). In English, positional capitalization occurs, for example, at the beginning of a sentence, or the beginning of quoted speech. Denotational capitalization is, to a first approximation, dependent upon whether the term or expression is a proper name.
Positional capitalization is straightforward; the rules governing positional capitalization are very clear. In the context of dictation and automatic speech recognition, sentence splitting is very accurate because the user must dictate the sentence-ending punctuation. By contrast, abbreviations and other phenomena make splitting written text into sentences a non-trivial task. In the context of dictation and automatic speech recognition, simple pattern matching allows one to do positional capitalization with near perfect accuracy.
Denotational capitalization is much harder to do automatically. Denotational capitalization can be viewed as the flip side of proper name recognitionxe2x80x94an information extraction task for which the current state of the art reports about a 94% combined precision and recall over a restricted set of name types. In proper name recognition, the goal is to correctly determine which expressions refer to (the same) named entities in a text, using the words, their position and their capitalization. The goal is to use an expression and its context to determine if it is a proper name, and therefore, should be capitalized.
Existing speech recognition systems tend to make a large number of errors on capitalizationxe2x80x94about 5-7% of dictated words, in English. Most of these errors are errors of denotational capitalization. The difficulty arises for terms which are both common nouns (or other uncapitalized words), and constituents of proper nouns, such as xe2x80x9cBill Gatesxe2x80x9d or xe2x80x9cBlack""s Disease.xe2x80x9d
Throughout the following description and claims, the term xe2x80x98tagxe2x80x99 is used to denote the properties that annotate a word or word phrase, including part of speech information. The term xe2x80x98featurexe2x80x99 is used in the maximum-entropy sense to mean the co-occurrence of certain items or properties.
A representative embodiment of the present invention includes a method of automatically rewriting the orthography of a stream of text words. If a word in the stream has an entry in an orthography rewrite lexicon, the word is automatically replaced with an orthographically rewritten form of the word from the orthography rewrite lexicon. In addition, selected words in the stream are compared to a plurality of features weighted by a maximum entropy-based algorithm, to automatically determine whether to rewrite orthography of any of the selected words. Orthographic rewriting may include properly capitalizing and/or abbreviating words in the stream of text words.
In a further embodiment, the method also includes, if a series of adjacent words in the stream has an entry in a phrase rewrite lexicon, replacing the series of adjacent words with a phrase form of the series of words from the phrase rewrite lexicon. Annotating linguistic tags may be associated with the orthographically rewritten form of the word. The method may also include providing linguistic tags to selected words in the stream, using context-sensitive rewrite rules to change the orthography of words in the stream based on their linguistic tags, and weighting the application of these rules in specific contexts according to maximum entropy weighting.
At least one of the features may be a context-dependent probability distribution representing a likelihood of a given word in a given context being in a given orthographic form. In a further embodiment, for each selected word, determining an orthographic rewrite probability representing a normalized product of the weighted features for that word, and if the orthographic rewrite probability is greater than a selected threshold probability, replacing that selected word with an orthographically rewritten form.