The invention relates to a method of recognizing a spoken text. The spoken text uttered by a speaker, is converted into first digital data which represent the spoken text. The first digital data are subjected to a speech recognition process depending on: available lexicon data which represent a lexicon; available language model data which represent a language model; and available reference data which represent phonemes. Second digital data which represent a recognized text are generated by the process. The recognized text is displayed using the second digital data. Third digital data are generated to correct the recognized text represented by the second digital data and a part of the second digital data are replaced by the third digital data, and as a result, fourth digital data which represent a corrected text, are obtained. Adaptation data for adapting the available reference data to the speaker of the spoken text are generated with the aid of the first digital data and the fourth digital data. Finally, the available reference data are adapted to the speaker of the spoken text with the aid of the adaptation data and the first digital data so as to obtain adapted reference data.
The invention further relates to a system for recognizing a spoken text. The system includes a conversion device by which the spoken text uttered by a speaker can be converted into first digital data which represent the spoken text. The system also includes a lexicon data device for available lexicon data which represent a lexicon and which are stored in the lexicon data device. The system has a language model data device for available language model data which represent a language model and which are stored in the language model data device. The system also has a reference data device for available reference data which represent phonemes and which are stored in the reference data device. The system includes a speech recognition device, with which the lexicon data device, the language model data device and the reference data device cooperate, and to which the first digital data is supplied and which supplies supply second digital data which represent a recognized text and which are generated during a speech recognition process carried out on the basis of the first digital data. The system has a display device to which the second digital data are applied in order to display the recognized text. The system has an error correction device for the correction of the text represented by the second digital data and by which digital data can be entered. A part of the second digital data are replaced by the third digital data, thereby generating fourth digital data, which represent a corrected text. Finally the system includes adaptation means to which the first digital data and the fourth digital data are applied and by which adaptation data for adapting the available reference data to the speaker of the spoken text can be generated. The adaptation data and first digital data are applied to the reference data device to adapt the available reference data to the speaker of the spoken text and the reference data adapted to the speaker of the spoken text stored in the reference data device.
A method of the type defined in the opening paragraph and a system of the type defined in the second paragraph are known from a so-termed speech recognition system which is commercially available from the Applicant under the type designation SP 6000. This known method will be described hereinafter with reference to FIG. 1. In FIG. 1 the various steps of the methods which are relevant in the present context are represented diagrammatically as blocks.
In known methods of recognizing text spoken into a microphone, shown diagrammatically in FIG. 1, by a speaker, the spoken text in the form of analog electric signals supplied by the microphone 1, is converted, in block 2, into first digital data by an analog-to-digital conversion process performed by an analog-to-digital converter. The resulting digital data representing the spoken text are stored in memory block 3.
Moreover, the first digital data representing the spoken text are subjected to a speech recognition process performed by a speech recognition device in block 4. In this speech recognition process, processing depends on: lexicon data representing a lexicon and available in a lexicon data device in block 5; language model data representing a language model and available in a language model data device in block 6; and reference data representing phonemes and available in a reference data device in block 7. The lexicon data represent not only words of a lexicon but also the phoneme sequences associated with the words, i.e. the phonetic script. The language model data represent the frequency of occurrence of words as well as the frequency of occurrence of given sequences of words in texts. The reference data represent digital reference patterns for phonemes, i.e. for a given number of phonemes, which are pronounced differently by different speakers in a speaker-specific manner, as a result of which, there are a multitude of speaker-specific reference patterns which form a speaker-specific reference data set for each phoneme. The quality of a speech recognition process improves as the reference data sets improve, i.e. if the reference patterns contained therein, are better adapted to a speaker. For this reason, the known method adapts the reference data to each speaker, as will be explained hereinafter. The better the corrected text obtained by correcting the recognized text, matches a spoken text the better this adaptation performs.
In the speech recognition process of block 4, phonemes and phoneme sequences are recognized on the basis of the first digital signals representing the spoken text with the aid of the reference data representing the phonemes and, finally, words and word sequences are recognized on the basis of the recognized phonemes and phoneme sequences and with the aid of the lexicon data and the language model data.
In block 4, second digital data are generated which represent recognized text. These second digital data are loaded into memory block 8.
In block 9, the recognized text is displayed on display device 10 using the second digital data. The display device is preferably a monitor, shown diagrammatically in FIG. 1. The purpose of displaying the recognized text is to give a speaker or user such as a typist, the opportunity to check the recognized text and to correct errors in the recognized text, preferably, error detection which occurs during the speech recognition process and the system points out the likely errors for correction.
In order to enable the recognized text to be checked in a simple manner, the first digital data representing the spoken text, stored in memory, are re-converted into analog electric signals in a digital-to-analog conversion process performed by a digital-to-analog converter in block 11. The signals are subsequently applied to loudspeaker 12, shown diagrammatically in FIG. 1, for acoustic reproduction of the spoken text. By listening to the acoustically reproduced spoken text and by simultaneously reading the displayed recognized text, the recognized text can be checked very simply for exactness or errors.
When the user detects an error in the recognized text in the speech recognition process in block 4, the user can carry out a correction process using an error correction device in the block 13. Using a keyboard 14 as shown diagrammatically in FIG. 1, the user generates third digital data to correct the recognized text represented by the second digital data. The second digital data are partly replaced by the third digital data in order to correct the recognized text in block 13, i.e. text portions, words or letters recognized as being incorrect by the user are replaced by the correct text portions, words or letters entered using keyboard 14. This partial replacement of the second digital data with the entered third digital data, results in fourth digital data representing a corrected text. The fourth digital data representing the corrected text, are loaded into memory block 15. The stored fourth digital data are displayed in block 16 on display device 10, as shown diagrammatically in FIG. 1. This concludes the actual speech recognition process in the known method.
However, as already stated hereinbefore, it is very effective in such a speech recognition process to adapt the reference data available in a reference data device in block 7 to the relevant speaker. This results in improved recognition quality during a subsequent speech recognition process of a further spoken text. In order to adapt the available reference data in the known method, adaptation data for the adaptation of the available reference data to the speaker of the spoken text are generated using the first digital data and the fourth digital data. The available reference data representing the phonemes are adapted to the speaker of the spoken text using the generated adaptation data and the first digital data, so that reference data adapted to the speaker of the spoken text are obtained. To generate the adaptation data, the known method carries out a verification process using a verification device in block 17. To carry out this verification process, the verification device receives the first digital data representing the spoken text as indicated by arrow 18, the second digital data representing the recognized text as indicated by arrow 19, the fourth digital data representing the corrected text as indicated by arrow 20, the lexicon data as indicated by arrow 21, and the reference data as indicated by arrow 22. Using all the data applied to it and complex heuristic methods in the verification process in block 17, in which inter alia a new speech recognition process is carried out, the verification device determines those text parts in the corrected text which best match corresponding text parts in the spoken text. The verification device uses the text recognized by the speech recognition device during the speech recognition process of a spoken text in block 4, taking into account the corrected text subsequently obtained by correction. These best matching text parts of the spoken text and the corrected text are represented by digital data, which form the afore-mentioned adaptation data. These adaptation data are loaded into memory block 23.
Furthermore, the adaptation data stored in memory block 23 and the first digital data stored in memory block 3 are used to adapt the reference data stored in the reference data device in block 7, as indicated by arrows 24 and 25. As a result of this adaptation, the reference data (i.e. the reference patterns for the various phonemes) are better adapted to a speaker, which leads to better recognition quality during a subsequent speech recognition process of uttered by this speaker text.
As is apparent from the above description of the known method, the known speech recognition system SP 6000 includes a separate verification device forming the adaptation apparatus for generating adaptation data by which the reference data available in the reference data device adapted to a speaker of a spoken text using the first digital data. The first digital data, the second digital data, the fourth digital data, the lexicon data, and the reference data are applied to this verification device. Using all the data applied to it and complex heuristic methods in a verification process in which, as already stated, a new speech recognition process is carried out, the verification device determines those text parts in the corrected text which best match corresponding text parts in the spoken text, taking into account the corrected text, and the verification device generates the adaptation data corresponding to the best matching text parts thus determined.
The above citations are hereby incorporated in whole by reference.