1. Technical Field
The present invention relates to speech processing systems and, more particularly, to a method and device for increasing the dialect precision and usability in speech recognition and text-to-speech systems.
2. Discussion of Related Prior Art
Generally, in a speech recognition system, each word of a vocabulary to be recognized is represented by a baseform wherein a word is divided for recognition purposes into a structure of phones, i.e. phonetic elements as shown in FIG. 1. See also, F. Jelinek, xe2x80x9cContinuous Speech Recognition by Statistical Methodsxe2x80x9d, Proceedings IEEE, Vol. 64, 1976, pp. 532-576, incorporated by reference herein.
These phones correspond generally to the sounds of vowels and consonants as are commonly used in phonetic alphabets. In actual speech, a portion of a word may have different pronunciations, as indicated in FIG. 2. FIG. 2 illustrates a freely choosable pronunciation alternative, with the first phone of the word having two pronunciation alternatives.
A typical speech recognition system would store a separate and distinct linear baseform representation for each pronunciation alternative, where each representation consists of a unique linear combination of phones or phonemes. For the xe2x80x9ceconomicsxe2x80x9d exemplar, the speech recognition system would store two separate linear strings, as illustrated at FIG. 2.
In addition to freely choosable pronunciation variations, typical speech recognition systems also store dialectal alternatives in a similar manner. FIG. 3 illustrates a dialectal alternative for the exemplar xe2x80x9ceconomicsxe2x80x9d illustrating both a New York City area and a Canadian pronunciation. FIG. 3 illustrates two dialectal alternatives; however, any number of dialectal variations may be considered by the method. FIG. 3 illustrates a dialectal variation at the fifth phone of the word. A typical speech recognition system would be required to store four separate linear baseform representations for the exemplar xe2x80x9ceconomicsxe2x80x9d to account for a single freely choosable pronunciation alternative and a single dialectal alternative.
For certain applications storing each of the baseform representations of a word is acceptable; in the general case, however, it can lead to problems. If, for example, you discover that additional variation must be considered subsequent to an initial construction stage, the process of editing the pronunciation lexicon can become tedious and subject to errors as a consequence of making each change manually. Another associated drawback of storing every conceivable baseform representation of a word or phrase occurs in real-time applications where a primary objective of the speech recognition system is to minimize the error rate. The common element in such real-time applications is that the speech recognition system is not afforded the luxury of enrolling the speaker (i.e. determining his or her speech characteristics in a sample session). Typical real-time applications may include, for example, a person walking up to a kiosk in a mall or subscribing over the telephone. By pre-storing all of the possible baseform representations in the lexicon, the speech recognition is more error-prone given the greater number of choices and no capacity to develop a characterization model of an individual to weight one pronunciation and/or dialect over another.
Accordingly, it would be desirable to provide a method and device for reducing the size of the pronunciation lexicon by storing only the reasonable pronunciations for a particular dialect or set of dialects. It is also desirable to eliminate errors inherent in manually inputting one or more variant baseforms, where such variations can be on the order of fifty or more in certain applications. Further, it is also desirable to reduce the cost and drudgery associated with the manual input of changes to the pronunciation lexicon.
In accordance with the present invention, a method for increasing both dialect precision and usability in speech recognition and text-to-speech systems is described. The invention generates non-linear (i.e. encoded)baseform representations for words and phrases from a pronunciation lexicon. The baseform representations are encoded to incorporate both pronunciation variations and dialectal variations. The encoded baseform representations may be later expanded (i.e. decoded) into one or more linear dialect specific baseform representations, utilizing a set of dialect specific phonological rules. The method provides the additional capability for a user specified dialect independent mode, whereby all encoded baseform variations will be included as part of the decoded output lexicon.
According to an illustrative embodiment, words and phrases from a pronunciation lexicon are encoded for both pronunciation and dialectal variations. A single encoded (i.e. non-linear) baseform representation will be stored for each word or phrase that contains a pronunciation and/or dialectal variation. Note that not all words and phrases will contain such variations, and as such they will be stored unencoded as linear baseform representations. Special encoding symbols are used to encode the variations. The encoded baseform representations may be later decoded (i.e. expanded) any number of times as needed into linear output baseform representations that are either dialect specific or dialect independent, depending upon a user specified dialect preference.
In accordance with an embodiment of the present invention, a computer based pronunciation lexicon generation system is formed with a first data file comprised of an encoded lexicon of non-linear baseforms and a second data file having one or more sets of dialect specific phonological rules. The system further includes a computer processor which is operatively coupled to the first and second data files and generates a third output data file therefrom. The output data file is a decoded pronunciation lexicon comprised of a plurality of linear (i.e. decoded) baseform representations. The output data file is generated by the processor which applies dialect specific phonological rules from the second data file to encoded baseform representations in the first data file. In the case where a user does not specify a preferred dialect, all of the phonological rules from the rule set database will be used to decode the first data file.
In one aspect of the invention, a method for generating a dialect specific pronunciation lexicon from an encoded pronunciation lexicon comprises the steps of: constructing an encoded pronunciation lexicon having a plurality of encoded and unencoded baseforms; inputting one or more user specified dialects; selecting dialect specific phonological rules from a rule set database; and decoding the encoded pronunciation lexicon using the dialect specific phonological rules to yield a dialect specific decoded pronunciation lexicon.
The method of the present invention is advantageous because (a) it facilitates the straightforward generation of different baseform sets for different dialects thereby increasing recognition accuracy (b) it eliminates the errors inherent in inputting multiple, sometimes fifty or more, variant baseforms (c) it allows significantly easier updates and corrections because the baseform representation is more perspicuous (d) it requires far less input for the system designer who is establishing the baseforms.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description or illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.