1. Field of the Invention
The present invention relates generally to speech recognition and in particular to providing a dictionary for a speech recognition system.
2. Description of the Related Art
A considerable problem in computer-assisted speech recognition is comprised in the size of the stored vocabulary. Since computer-assisted speech recognition is implemented by comparing sequences of feature vectors that respectively represent a word in a feature space, speech recognition becomes more and more inexact with the size of a plurality of words to be compared. The cause of the increasing imprecision of the speech recognition given an increasing plurality of words to be compared lies in the higher plurality of sequences of feature vectors in the feature space. As a result of the increased plurality of feature vectors, the distances of the feature vectors from one another in the feature space become smaller and, thus, a classification of the word to be recognized also becomes more and more problematical.
The necessity of reducing the vocabulary thus derives. A problem in the reduction of the vocabulary, however, is comprised therein that those words that are not stored with the corresponding sequences of feature vectors are also not recognized.
Further, the considerable memory requirement is to be considered a disadvantage of storing a large vocabulary. This high requirement for memory space leads to considerable costs that are unavoidable given employment of a large vocabulary that is not adapted to the application.
The structure of a text phoneme-converter is known, for example, from publications K. Wothke, Morphologically Based Automatic Phonetic Transcription, IBM Systems Journal, Vol. 32, No. 3, pp. 486-511, 1993; S. Besling, Heuristical and Statistical Methods for Grapheme-to-Phoneme Conversion, Proceedings CONVENS 1994, Vienna, AUstria, pp. 23-31, 1994; and W. Daelemens et al., Data-oriented Methods for Grapheme-to-Phoneme Conversion, Proceedings EACL 1993, Utrecht, pp. 45-53, 1993.
A survey of the fundamentals of expert systems is described in the publication by K. Bauer et al. Expertensysteme: Einfxc3xchrung in Technik und Anwendung, Siemens A G, Engioneering and Kommunikation, D. Nebendahl (Ed.), Siemens A G (Publishing Department), ISBN 3-8009-1495-6, pp. 27-82, 1987.
The invention is based on the problem of specifying an arrangement for producing a digital dictionary with which a flexible, favorable and qualitatively high-grade speech recognition can be realized on the basis of the digital dictionary. Another problem is to specify an arrangement for speech recognition upon employment of the digital dictionary. The invention is also based on the problem of specifying a method for producing a digital dictionary with the assistance of a computer that enables a flexible, favorable and qualitatively high-grade speech recognition on the basis of the digital dictionary.
The first described arrangement comprises a means for reading arbitrary electronic documents in, as well as a first memory for permanent storage of standard words and their phoneme sequences and a second memory for temporary storage of additional words and their phoneme sequences as well as a text-phoneme converter. As a result of the means for selecting and reading in arbitrary electronic documents, those documents that contain a suitable vocabulary for a prescribable application situation are automatically selected and read in from an arbitrary set of electronic documents on the basis of the greatest variety of criteria. Added words for which the corresponding phoneme sequences are formed with the text-phoneme converter are selected from the words. The added words and the corresponding phonemes are temporarily stored in the second memory. At least the standard words and the added words and their respective phoneme sequences form the digital dictionary.
This arrangement makes it possible to respectively determine an application-specific vocabulary for the duration of a specific, characteristic application and to temporarily store these words together with the corresponding phoneme sequences and temporarily incorporate them in the digital dictionary. When the application changes, new electronic documents are again identified and added words are again selected from them, these then being again temporarily stored in the second memory with the corresponding phoneme sequences. The inventive arrangement thus achieves a considerable reduction of the memory space required.
Compared to the first described arrangement, the second disclosed arrangement additionally comprises a means for speaker-dependent speech recognition. By reducing the respectively investigated (compared) vocabulary in the speech recognition, it thereby becomes possible to achieve a better recognition performance than is possible with the known arrangements, since the plurality of words to be compared has been reduced without, however, leaving specific words characteristic of the application out of consideration. Further, the inventive arrangement can be very flexibly utilized under the greatest variety of application situations without requiring involved training phases of the vocabulary for each characteristic application situation. As a result thereof, a substantially lower requirement of memory space is achieved, this leading to a considerable saving of costs for the arrangement for speech recognition.
In the method according to patent claim 6, a digital dictionary that already comprises standard words and phoneme sequences allocated to the standard words at the beginning of the method is built up with the assistance of a computer. Electronic documents are identified in a first step and added words are selected from the electronic documents. A respective phoneme sequence that is allocated to the respective added word is formed for the added words with the assistance of the text-phoneme converter. The added words are temporarily stored and temporarily assigned to the digital dictionary. As a result of the inventive method, a digital dictionary is built up in a way that can be very flexibly adapted to changing applications. A speech recognition that uses the dictionary which has been built up according to the present invention is thus implemented rapidly and reliably with low costs since, first, the plurality of permanently stored words is reduced and, second, the density of the individual phoneme sequences in the feature space is likewise reduced, this leading to improved recognition performance in the speech recognition.
Advantageous developments of the inventive arrangements as well as advantageous developments of the inventive method are provided by a decision unit that is additionally provided for the selection of the added words from the selected electronic documents. A third memory may be provided for storing predetermined reserve words and phoneme sequences allocated to the reserve words that are temporarily stored for each application, whereby the phoneme sequences of the reserve words exhibit a higher quality than the phoneme sequences that are formed by the text-phoneme converter. In one embodiment, a fourth memory for storing speech feature vectors of sayings and/or words that are prescribed by a user, whereby the speech feature vectors respectively characterize a part of the word. User feature vectors of a part of a digitalized voice signal that characterize the part of the digitalized voice signal are compared to stored phoneme feature vectors and/or to stored speech feature vectors, whereby the phoneme feature vectors respectively characterize the phoneme, and whereby the speech feature vectors respectively characterize a part of the word. As a preferred development, the determination of the electronic documents ensues according to at least one of the following rules: an unambiguous allocation of the electronic documents is predetermined; spoken words of a user are recognized by the arrangement for speech recognition and the determination ensues on the basis of the recognized words and on the basis of a predetermined, first rule base; spoken words of a user are recognized by the arrangement for speech recognition and the determination ensues on the basis of the recognized words and on the basis of a predetermined, first rule base in an interactive dialogue between the arrangement for speech recognition and the user. Information characterizing the document type is stored for each document type of the electronic documents, and the electronic documents are determined on the basis of the information. The selection of the added words from the electronic documents ensues according to at least one of the following procedures: those words from the electronic documents that are marked in at least one prescribable way in the electronic documents are selected from the electronic documents as added words; the selection ensues on the basis of a predetermined, second rule base; those words from the electronic documents are selected as added words that are marked in at least one prescribable way in the electronic documents and that exhibit an importance factor derived from the marking that is greater than prescribable importance factors of other words; those words from the electronic documents that are marked in at least one prescribable way in the electronic documents are selected from the electronic documents as added words. Thus, the markings that are employed for the selection of the added words are determined from a particular document type of the respective electronic document. The method according to the present invention preferably has predetermined reserve words and phoneme sequences allocated to the reserve words which are temporarily stored in every application dependent on the application and are assigned to the dictionary, allowing the phoneme sequences of the reserve words to exhibit a higher quality than the phoneme sequences that are formed by the text-phoneme converter. In one embodiment, a digitalized voice signal spoken by a user is divided into an arbitrary plurality of digital voice signal parts. The following steps are implemented for each voice signal part for a prescribable set of voice signal parts: a user feature vector that characterizes the voice signal part is determined for the voice signal part; a respective similarity value of the user feature vector with the respective phoneme feature vector is determined from a comparison of the user feature vector to at least an arbitrary plurality of phoneme feature vectors that respectively characterize a stored phoneme; the similarity values are stored; a sequence of phonemes for the, digitalized voice signal with which the digitalized, voice signal is described is determined on the basis of the similarity values; and the digitalized voice signal and the sequence of phonemes allocated to the digitalized voice signal are stored in a fourth memory. According to the present invention, speaker-independent speech recognition using a digital dictionary constructed according to the foregoing is provided. In particular, a digitalized voice signal spoken by a user is divided into sets of digital voice signal parts; a sequence of feature vectors with which the digitalized voice signal is characterized is formed from the digitalized voice signal parts; a sequence of phonemes that is most similar to the sequence of feature vectors With respect to a similarity criterion is determined on the basis of the feature vectors in the feature space; and a recognized word derives from the sequence of phonemes.
In a development of the inventive arrangement, it is advantageous to provide a decision unit WE with which a selection of the added words from the electronic documents is implemented. The selection can ensue in the greatest variety of ways. A further flexibility in the use of the arrangement for speech recognition is thus achieved by this development. A further reduction of the requirement for memory space is also achieved by this development since, due to an intentional selection of the added words from the words of the electronic documents, not all words but only those that are in fact of great significance with respect to the respective application situation need be temporarily stored.
Further, it is advantageous in a development of the arrangement to provide a third memory in which reserve words and their phoneme sequences are temporarily stored, whereby the phoneme sequences exhibit a higher quality than the phoneme sequences that are formed by the text-phoneme converter. As a result of this development, the quality of the xe2x80x9cphoneme prototypesxe2x80x9d is enhanced, as a result whereof the results of the speech recognition are clearly improved.
It is also advantageous to provide an arrangement with which phoneme sequences are formed for words spoken by a user and these are likewise stored. The employment of the arrangement is thus no longer limited to only electronic documents; rather, the arrangement can, as a result of this development, also process words spoken by a user. This leads to an enhanced flexibility of the employment of the arrangement for speech recognition.
It is also advantageous in a development for the inventive method according to patent claim 6 to implement the determination, i.e. the selection, of the electronic documents according to different rules.
For example, it is thus advantageous to unambiguously allocate the electronic documents to specific applications, as a result whereof the speed of the implementation of the determination of the electronic documents is considerably increased.
A refinement and further flexibilization of the selection of the electronic documents is achieved in that spoken words of a user are recognized by the arrangement for speech recognition and the determination ensues on the basis of the recognized words as well as of a predetermined, first rule base. The actual selection of the electronic documents is thus matched considerably better to the application, and, thus, added words are selected that correspond very well to the respective application.
It is also advantageous to determine the selection with reference to the rule base in an interactive dialogue between the arrangement and the user in order to thus achieve a further refinement of the selection. A very simple development of the inventive method can also be seen therein that the selection ensues on the basis of document types of the electronic documents.
It is also advantageous to implement the selection of the added words from the electronic documents in a way that words marked in a specific way are automatically selected as added words. The selection of the added words is thus substantially accelerated. It is also advantageous to make the selection of the added words on the basis of a predetermined, second rule base. The selection of the added words is thereby considerably refined without noticeably reducing the added words with substantial significance with respect to the speech recognition process in the respective application.
It is also advantageous to derive an importance factor for the individual markings in the electronic documents in order to indirectly select only those marked words therefrom that are in fact of significance. This leads to a further reduction of the memory requirement since a lower number of added words is stored. The recognition performance of the speech recognition that uses the digital dictionary is also enhanced further since only a lower plurality of added words need be compared to one another. The density of the feature space is thus reduced further.
It is also advantageous to determine markings on the basis of the document types of the respective electronic document, these markings being used for the selection of the added words.
The quality of the xe2x80x9cprototypesxe2x80x9d of the phoneme sequences is enhanced in that predetermined reserve words with corresponding phoneme sequences are taken into consideration. The phoneme sequences of the reserve words have a higher quality than the phoneme sequences that are formed by the text phoneme converter, as a result whereof an enhancement of the speech recognition implemented later upon employment of the digital dictionary is achieved.
So that it is also possible to consider words that are spoken by a user, it is advantageous in a development of the method to identify and store a phoneme sequence for the respectively spoken voice signal.