1.1 Field of the Invention
The present invention relates to a speech recognition system and a method executed by a speech recognition system. More particularly, the invention relates to the vocabulary of a speech recognition system and its usage during the speech recognition process.
1.2 Description and Disadvantages of Prior Art
The invention may preferably be implemented in accordance with the IBM ViaVoice 98 speech recognition system developed by the present assignee. IBM ViaVoice 98 is a real time speech recognition system for large vocabularies which can be speaker-trained with little cost to the user. However, the invention is not limited to use with this particular system and may be used in accordance with other speech recognition systems.
The starting point in these known systems is the breakdown of the speech recognition process into a part based on acoustic data (decoding) and a language statistics part referring back to bodies of language or text for a specific area of application (language model). The decision on candidate words is thus derived both from a decoder and a model language probability. For the user, the fitting of the vocabulary processed by this recognition system, to the specific field or even to individual requirements, is of particular significance.
With this speech recognition system, the acoustic decoding first supplies hypothetical words. The further evaluation of competing hypothetical words is then based on the language model. This represents estimates of word string frequencies obtained from application-specific bodies of text based on a collection of text samples from a desired field of application. From these text samples are generated the most frequent forms of words and statistics on word sequences.
In the method used here for estimating the frequency of sequences of words, the frequency of occurrence of the so-called word form trigrams in a given text are estimated. In known speech recognition systems, the so-called Hidden Markov Model is frequently used for estimating the probabilities. Here, several frequencies observed in the text are set down. For a trigram xe2x80x9cuvwxe2x80x9d these are a nullgram term f0, a unigram term f(w), a bigram term f(w|v) and a trigram term f(w|uv). These terms correspond to the relative frequencies observed in the text, where the nullgram term has only a corrective significance.
If these terms are interpreted as probabilities of the word w under various conditions, a so-called latent variable can be added, from which one of the four conditions which produce the word w is achieved by substitution. If the transfer probabilities for the corresponding term are designated xcex0 xcex1 xcex2 xcex3, then we obtain the following expression for the trigram probability sought
Pr(w|uv)=xcex0f0+xcex1f(w)+xcex2f(w|v)+xcex3f(w|uv)
The known speech recognition systems have the disadvantage that each word appears as a word form in the vocabulary of the system. For this reason there are relatively large demands on the memory capacity of the system. The generally very extensive vocabularies also have a disadvantageous effect on the speed of the recognition process.
Typical speech recognition systems are working in real-time on today""s PCs. They have an active vocabulary of up to and exceeding 60,000 words, can recognize continuously and/or naturally spoken input without the need to adapt the system to specific characteristics of a speaker. S. Kunzmann; xe2x80x9cVoiceType: A Multi-Lingual, Large Vocabulary Speech Recognition System for a PCxe2x80x9d, Proceedings of the 2nd SQEL Workshop, Pilsen, Apr. 27-29, 1997, ISBN 80-7082-314-3) gives an outline on these aspects. Given the actual vocabulary used in human communication, the order of magnitude of the vocabulary recognized by computer-based speech recognition systems must actually reach hundreds of thousands to several million words. Even if such large vocabulary sizes would be available today, beside algorithmic limitations on recognizing these extremely large vocabulary sizes, issues like recognition accuracy, decoding speed and system resources (CPU, memory, disc) play a major role for classifying real-time speech recognition systems.
In the past several approaches have been suggested to increase the size of the active vocabulary for such recognition systems. In particular such state of the art approaches are related to the handling of compound words.
The German patent for instance DE 19510083 C2 assumes that the compound words e.g., German xe2x80x9cFahrbahnschalterxe2x80x9d or xe2x80x9cvorgehenxe2x80x9d are decomposed in constituents like xe2x80x9cFahrbahn-schalterxe2x80x9d or xe2x80x9cvorgehenxe2x80x9d. The assumption is that composita are split in constituents which are a sequences of legal words in the German language as well as in the recognition vocabulary (xe2x80x9cFahrbahnxe2x80x9d, xe2x80x9cSchalterxe2x80x9d and xe2x80x9cvorxe2x80x9d, xe2x80x9cgehenxe2x80x9d). For each of these words statistics are computed, describing the most likely frequencies of each word (Fahrbahnschalter, vorgehen) in their context of occurrence e.g., xe2x80x9cDer Fahrbahnschalter ist geschlossenxe2x80x9d. In addition separate frequency statistics are computed which describe the sequence of these constituents within compound words. Both statistical models are used to decide if the individual constituents are displayed to the user as single words or as compound word. Cases like xe2x80x9cVerfxc3xcgbarkeitxe2x80x9d (constituents: xe2x80x9cverfxc3xcgbarxe2x80x9d+xe2x80x9ckeitxe2x80x9d) or xe2x80x9cBirnenxe2x80x9d (constituents: xe2x80x9cBirnexe2x80x9d+xe2x80x9cnxe2x80x9d) are not covered since xe2x80x9ckeitxe2x80x9d and xe2x80x9cnxe2x80x9d are neither legal (standalone) words nor syllables in the German language, thus it""s not contained within the recognition vocabulary. According to this teaching an additional, separate frequency model is required to allow the resolving of problems of illegal word sequences during recombination of these arbitrary constituents into words (e.g. xe2x80x9cvorxe2x80x9dxe2x88x92xe2x80x9cVerfxc3xcgbarxe2x80x9d).
The recent U.S. patent U.S. Pat. No. 5,754,972 teaches the introduction of a special dictation mode where the user either announces a xe2x80x9ccompound dictation modexe2x80x9d or the system is switched into a special recognition mode. This is exposed to the user by a specific user interface. In languages like German the occurrence of compound words is extremely frequent, so the need to switch towards specific dictation modes is extremely cumbersome. In addition, the teaching of U.S. Pat. No. 5,754,972 is based on the same fundamental assumption as German patent DE 19510083 C2: compound words can be built only on constituents representing legal words of the vocabulary by their own. To support the generation of new compound words the spelling of the characters of the compound word is introduced within this special dictation mode.
A different approach is disclosed by G. Ruske, xe2x80x9cHalf words as processing units in automatic speech recognitionxe2x80x9d, Journal xe2x80x9cSprache und Datenverarbeitungxe2x80x9d, Vol. 8, 1984, Part xc2xd, pp. 5-16. A word of the recognition vocabulary is usually described via it""s orthography (spelling) and it""s associated (multiple) pronunciations via smallest recognition units. The recognition units are the smallest recognizable units for the decoder. G. Ruske defines these recognition units based on a set of syllables (around 5000 in German). To each spelling of the vocabulary, a sequence of syllables describes the pronunciation(s) of each individual word. Thus, according to the teaching of Ruske, words of the vocabulary are set up by the recognition units of the decoder being identical to the syllables according to the pronunciation of the word in that language. Therefore, the recombination of constituents to build words of the language is thus limited to the recognition units of the decoder.
1.3 Objective of the Invention
The invention is based on the objective to provide a technology to increase the size of an active vocabulary recognized by speech recognition systems. It is a further objective of the current invention to reduce at the same time the algorithmic limitations on recognizing such extremely large vocabulary sizes for instance in terms of recognition accuracy, decoding speed and system resources (CPU, memory, disc), and thus to play a major role in classifying real-time speech recognition systems.
These and other objectives are achieved by a speech recognition system according to the present invention. The invention teaches a speech recognition system for recognition of spoken speech of a language comprising a segmented vocabulary. The vocabulary includes a multitude of entries. An entry can be either identical to a legal word of said language, or an entry can be a constituent of a legal word of said language. A constituent can be an arbitrary sub-component of said legal word according to the orthography. The constituent is not limited to a syllable of said legal word or to a recognition unit of said speech recognition system.
The technique proposed by the current invention allows for a significant compression of a vocabulary. The invention allows to define and store N words but generate and recognize up to Mxc3x97N words (where M is language dependent) as combinations of the vocabulary entries.
Smaller vocabularies allow in addition a better estimation of the word (or piece) probabilities (uni-, bi-, tri-grams within their context environment as more occurrences are seen in the respective corpora.
Efficient storing is achieved via mapping the N words into a set of groups having the same pattern of constituents. Such an approach ensures logical completeness and coverage of the chosen vocabulary. Usually the user who dictates a word defined in the vocabulary expects that all derived forms are also available. For example, one doesn""t expect that the word xe2x80x98usexe2x80x99 is in the vocabulary while xe2x80x98userxe2x80x99 is not.
Complete flexibility (as with the current teaching) in defining the constituent sets for each language makes it possible to achieve the best compression. The constituents are not necessarily a linguistic or phonetic known unit of the language.
Additional advantages are accomplished by said vocabulary defining legal words of said language recognizable by said speech recognition system either by an entry itself or by recombination of up to S entries in combination representing a legal word of said language. The invention preferably suggests S being the number 2 or 3.
As any number of constituents can be used for recombination of legal words, the compression rate of such a segmented vocabulary can be very large. On the other hand, the compression rate and the algorithmic complexity for recombination are antagonistic properties of the proposed speech recognition system. To limit the number of segments to recombine constituents into legal words to S=2 or S=3 is an effective compromise.
According to a further embodiment of the proposed invention the speech recognition system, if based on a segmented vocabulary, comprises, if S is 2, constituents allowing for recombination of legal words from a prefix-constituent and a core-constituent, or from a core-constituent and a suffix-constituent, or from a prefix-constituent and a suffix-constituent. In addition said vocabulary comprises, if S is 3, constituents allowing for recombination of legal words from a prefix-constituent, a core-constituent and a suffix-constituent.
By distinguishing different types of constituents, properties of the individual languages can be reflected since typically not every constituent type can be recombined with any other constituent type. This approach simplifies the recognition process and eases the determination of recognition errors.
According to a further embodiment of the proposed invention, a constituent combination table is taught. It indicates which concatenations of said constituents are legal concatenations in said language.
Such constituent combination tables are performance and storage efficient means to define which constituent may be recombined with other constituents resulting in a legal constituent or legal word of said language.
According to a further embodiment of the proposed invention, said constituent combination table comprises in the case of S=2 or S=3, a core-prefix-matrix indicating whether a combination of a prefix-constituent and a core-constituent is a legal combination in said language or not; and/or a prefix-suffix-matrix indicating whether a combination of a prefix-constituent and a suffix-constituent is a legal combination in said language or not; and/or a prefix-prefix-matrix indicating whether a combination of a first-prefix-constituent and a second-prefix-constituent is a legal combination in said language building a third-prefix constituent or not; and/or a core-suffix-matrix indicating whether a combination of a core-constituent and a suffix-constituent is a legal combination in said language or not.
The approach to reduce the question of legal recombinations to a sequence of decisions involving only two constituents reduces computation effort. Moreover, introduction of a collection of constituent combination tables depending on the types of constituents to be recombined increases efficiency of the recombination process. Depending on the type of constituents for certain cases, no legal combination is possible and thus no table access has to be performed. Also, in terms of access and storage requirements, it is more efficient to exploit a larger number of smaller tables than only a few larger tables.
According to a further embodiment of the proposed invention, said core-prefix-matrix and/or said core-suffix-matrix and/or said prefix-suffix-matrix and/or said prefix-prefix-matrix have a structure wherein said core-constituents, said prefix-constituents and said suffix-constituents are represented by unique numbers which form the indexes of said matrixes.
By encoding the various constituents as unique numbers and by setting up the various constituent combination tables based on these numbers, the complete recombination and recognition process is accelerated as no translations between constituents and their encodings are required anymore.
According to a further embodiment of the proposed invention, a separate post-processor is suggested responsive to an input comprising recognized constituents of said vocabulary. Said post-processor recombines said constituents into legal words of said language exploiting said constituent combination table.
Implementing the recombination of constituents into a separate post-processor has the advantage that the teaching of the current invention can be applied to any existing speech recognition system without further modification or enhancements. If recombination is done in a post-processor, the statistic correlation information of the language model has been exploited already when the post-processor becomes active. Thus, the reliability of the recognized constituents is already high when inputted to the post-processor and will be increased further by said post-processing.
A further embodiment of the proposed invention relates to details of the recombination. Several cases can be distinguished.
Said post-processor is responsive to two consecutive constituents representing a first prefix-constituent and a second prefix-constituent and recombines said first prefix-constituent and said second prefix-constituent into a third prefix-constituent if said prefix-prefix-matrix is indicating said first prefix-constituent and said second prefix-constituent as a legal combination in said language. If said prefix-prefix-matrix indicates said first prefix-constituent and said second prefix-constituent as an illegal combination in said language, said first prefix-constituent is dropped.
Said post-processor is responsive to two consecutive constituents representing a prefix-constituent and a core-constituent and recombines said prefix-constituent and said core-constituent into a second core-constituent if said core-prefix-matrix is indicating said prefix-constituent and said core-constituent as a legal combination in said language. If said core-prefix-matrix indicates said prefix-constituent and said core-constituent as an illegal combination in said language, it replaces said prefix-constituent with an alternative prefix-constituent and recombines said alternative prefix-constituent and said core-constituent if said core-prefix-matrix is indicating said alternative prefix-constituent and said core-constituent as a legal combination in said language.
Said post-processor is responsive to two consecutive constituents representing a prefix-constituent and a suffix-constituent and recombines said prefix-constituent and said suffix-constituent into a second prefix-constituent if said prefix-suffix-matrix is indicating said prefix-constituent and said suffix-constituent as a legal combination in said language.
Said post-processor is responsive to two consecutive constituents representing a core-constituent and a suffix-constituent and recombines said core-constituent and said suffix-constituent into a second core-constituent if said core-suffix-matrix is indicating said core-constituent and said suffix-constituent as a legal combination in said language.
Besides recombining constituents, these features offer the advantages of detecting and also in a certain extent of correcting recognition errors.
According to a further embodiment of the proposed invention, said prefix-constituent and said suffix-constituent are not recombined and said prefix-constituent is treated as a separate entry if said prefix-suffix-matrix is indicating said prefix-constituent and said suffix-constituent as an illegal combination in said language. Moreover, said core-constituent and said suffix-constituent are not recombined and said core-constituent is treated as a separate entry if said core-suffix-matrix is indicating said core-constituent and said suffix-constituent as an illegal combination in said language.
This invention feature allows for determination of word boundaries.
According to a further embodiment of the proposed invention, said alternative prefix-constituent is retrieved from an alternative-list comprising alternative prefix-constituents to said prefix-constituents in decreasing matching probability.
Such an approach further increases recognition accuracy.
The objectives stated above are also achieved by the method of the invention. Further embodiments of the proposed invention are provided herein.
For the feature details it is referred to the claims. The features are in tight correspondence to the device claims. As far as the advantages are concerned, above statements relating to the claimed device are also applicable.