This invention relates to a speech recognition apparatus and a method thereof for recognizing words of a specific foreign language contained in a speech spoken by a speaker who has a specific native language, for example, a speech recognition apparatus and a method thereof for recognizing an English speech spoken by a Japanese speaker to output data (text data) indicating a string of English words contained in the speech.
This invention also relates to a pronunciation correcting apparatus and method for teaching a correct pronunciation to a speaker to correct the pronunciation utilizing data (candidate word data) obtained in said speech recognition apparatus and in the course of practicing said method.
A speech recognition apparatus has been so far used for recognizing words contained in a speech spoken by an unspecified speaker to output the words as text data.
PUPA 06-12483, PUPA 08-50493 and PUPA 09-22297 (references 1-3), for example, disclose such speech recognition method.
For example, when English text data is generated from an English speech spoken by a Japanese by an English speech recognition apparatus for recognizing English words from an English speech using a conventional speech recognition method, the recognition rate is low. This is because English language contains a sound which does not exist in Japanese language (th, etc.) or a sound which is difficult to be discriminated in Japanese language (l, r, etc.) and Japanese are not generally capable of pronouncing such English sound correctly so that the English speech recognition apparatus translates an incorrect pronunciation into a word as it is. For example, even when a Japanese speaker intends to pronounce xe2x80x9cricexe2x80x9d in English, the English speech recognition apparatus may recognize this pronunciation as xe2x80x9clicexe2x80x9d or xe2x80x9clousexe2x80x9d.
Such inexpediences may occur in various situations such as when an American speaker whose native language is English uses a speech recognition apparatus for generating a Japanese text from a speech in Japanese contrary to the above, when a British speaker whose native language is British English uses a speech recognition apparatus tuned for American English, or when a particular person has a difficulty to pronounce correctly by some reason.
The speech recognition apparatus disclosed in the above references, however, are unable to solve such inexpediences.
If English pronunciation of the speaker is improved approaching a pronunciation of a native speaker, the recognition rate of the speech recognition apparatus is naturally improved and it is in fact desirable for a speaker to improve English conversation.
For example, PEPA4-54965 discloses a learning apparatus for recognizing English speech of a speaker and causes the speaker to affirm the recognized English speech (reference 4).
Also, PUPA60-123884, for example, discloses an English learning machine for letting the speaker to listen to a speech to learn by using a speech synthesizer LSI (reference 5).
A learning apparatus for learning pronunciation of foreign language is disclosed in many other publications including PEPA44-7162, PEPA H7-117807, PEPA61-18068, PEPA8-27588, PUPA62-111278, PUPA62-299985, PUPA3-75869, PEPA6-27971, PEPA8-12535, and PUPA3-226785 (references 6 to 14).
However, the speaker can not necessarily attain a sufficient learning effect using the learning apparatuses disclosed in these references because the speaker has to compare his or her own pronunciation with a presented pronunciation or he or she fails to find which part of his or her pronunciation is wrong.
This invention is conceived in view of the above described problems of the conventional technology and aims at providing a speech recognition apparatus and a method thereof for recognizing words contained in a speech of a predetermined language spoken by a speaker whose native language is other than the predetermined language (non native) and translating the words into the words of the predetermined language intended by the speaker to generate correct text data.
It is also an object of this invention to provide a speech recognition apparatus and a method thereof for translating a speech spoken by a speaker in any region into a word intended by the speaker to enable correct text data to be generated even when pronunciation of a same language varies due to the difference of the regions where the language is spoken.
It is also an object of this invention to provide a speech recognition apparatus and a method thereof which compensates for the difference of pronunciation by individuals to maintain a consistently high recognition rate.
It is another object of this invention to provide a pronunciation correcting apparatus and method for pointing out a problem of a speaker""s pronunciation, and letting the speaker learn a native speaker""s pronunciation to correct the speaker""s pronunciation by utilizing data obtained from said speech recognition apparatus and in the course of practicing said method.
It is still another object of this invention to provide a speech correcting apparatus and method for correcting pronunciation which is capable of automatically comparing speaker""s pronunciation with a correct pronunciation to point out an error and presenting detailed information indicating how the speaker should correct the pronunciation.
In order to achieve the above objectives, this invention provides a first speech recognition apparatus for recognizing words from speech data representing one or more words contained in a speech comprising; candidate word correlating means for correlating each of one or more of said speech data items of words to one or more sets of candidates (candidate words) comprising a combination of one or more of said words obtained by recognizing each of one or more of said speech data items, analogous word correlating means for correlating each of said candidate words correlated to each of one or more of the speech data items of said words to null or more sets of a combination of one or more of said words (analogous words) which may correspond to pronunciation of each of said candidate words, and speech data recognition means for selecting either said candidate word correlated to each of one or more of said speech data items of words or said analogous word correlated to each of said candidate word as a recognition result of each of said speech data items of words.
Preferably, said speech data represents one or more words contained in a speech of a predetermined language, said candidate correlating means correlates each of one or more speech data items of said words to one or more sets of candidate words of said predetermined language obtained by recognizing each of the one or more speech data items, said analogous word correlating means correlates each of said candidate words correlated to each of the one or more speech data items of said words to null or more sets of analogous words of said predetermined language which may correspond to the pronunciation of each of said candidate words, and said speech data recognition means selects either said candidate word correlated to each of one or more of said speech data items of words or said analogous word correlated to each of said candidate word as a recognition result of each of one or more of speech data items of said words.
Preferably, the speech of said predetermined language is pronounced by a speaker who mainly speaks a language other than said predetermined language, the speech recognition apparatus is provided with analogous word storage means for storing null or more sets of words of said predetermined language which may correspond to each of one or more speech data items of the words contained in the speech of said predetermined language in correlation to each of one or more words of said predetermined language as said analogous word of each of one or more words of said predetermined language when each of one or more words of said predetermined language is pronounced by said speaker, and said analogous word storage means correlates null or more of said analogous words which were stored beforehand in correlation to each of one or more words of said predetermined language to each of said candidate words.
Preferably, said candidate word correlating means associates each of said candidate words correlated to the speech data with probability data indicating a likelihood of each of said candidate word correlated to the speech data, and said speech data recognition means selects only said candidate word having a value of said probability data within a predetermined range as the result of the recognition of the speech data of said words.
Preferably, said candidate word correlating means associates each of said candidate words correlated to the speech data with error information indicating an error of pronunciation corresponding to each of said candidate words.
The speech recognition apparatus of this invention recognizes a word contained in an English speech (voice) pronounced by a speaker (a Japanese speaker, for example) whose native language (Japanese language, for example) is other than a predetermined language (English language, for example) and who mainly speaks the native language and translates it to an English word to generate text data.
In the speech recognition apparatus of this invention, an English speech (speech data) spoken by a Japanese speaker, inputted from a microphone, etc., and converted to digital data is converted to quantized vector data according to features of the sound (loudness, intensity and intonation, etc., of the sound) and further converted to sound data which is analogous to a phonetic symbol and called a label for output to the candidate word correlating means.
The candidate word correlating means processes the speech data converted to a label word by word or by a series of a plurality of words and correlates the speech data to a single English word or a combination of a plurality of English words (collectively called a candidate word) as a candidate of the result of recognizing the speech data.
The analogous word storage means stores dictionary data for retrieval, for example, in which a single English word or a combination of a plurality of English words which may correspond to speech data (collectively called analogous word) when a Japanese speaker pronounces English language though not a correct English pronunciation is or are beforehand correlated to a single English word or a combination of a plurality of English words which can be a candidate word.
For example, in order to deal with an inaccurate English pronunciation by a Japanese speaker, a single English word xe2x80x9cleadxe2x80x9d which may be a candidate word is correlated to an analogous word xe2x80x9creadxe2x80x9d (in consideration of xe2x80x9clxe2x80x9d and xe2x80x9crxe2x80x9d which are difficult for Japanese speakers to discriminate since Japanese speakers generally can not correctly pronounce xe2x80x9crxe2x80x9d) in the dictionary data. Occasionally, there is no analogous word to an English word. In such case, an analogous word is not correlated to an English word.
The analogous word correlating means searches the dictionary data stored in the analogous word storage means to read out an analogous word correlated to a candidate word and correlates the analogous word to the candidate word. In the above example, speech data corresponding to an English word xe2x80x9creadxe2x80x9d pronounced by a Japanese speaker is correlated to an English word xe2x80x9cleadxe2x80x9d and an analogous word xe2x80x9creadxe2x80x9d.
The speech recognition means selects either a candidate word correlated to speech data or an analogous word as a result of recognition based on a syntactic parsing of a string of English words so far recognized or in response to a selection by a user, for example.
The components of the speech recognition apparatus of this invention processes speech data inputted one after another in the manner as described in the above to recognize English words contained in the speech data and generates text data concatenating the recognized English words.
While an English speech by a Japanese speaker has been so far described as an example, the speech recognition apparatus of this invention can recognize both an English speech in a British pronunciation and one in an American pronunciation to generate text data by modifying the analogous word storage means such that it stores dictionary data which correlates an analogous word which may correspond to speech data to a candidate word when the speech is pronounced in a British English pronunciation which is different from an American English pronunciation.
In this way, the scope of the above xe2x80x9cpredetermined languagexe2x80x9d is defined as a scope in which speech data can be correlated to a word with a sufficient recognition rate. Therefore, dialects (English languages in the US, England, Australia, and South Africa, etc., and Spanish languages in Spain and south American countries, for example) for which a sufficient recognition rate is not obtained only by a candidate word correlating means adjusted for one of the dialects are not included in a same scope of said xe2x80x9cpredetermined languagexe2x80x9d even if they are normally said to be a same language because they are pronounced differently due to a geographical separation. The same is true when the pronunciation of a particular person becomes obscure by some reason or when a sufficient recognition rate is not obtained only with a candidate word correlating means which is adjusted to the native language (the language mainly spoken).
The second speech recognition apparatus of this invention recognizes one or more words of said predetermined language from speech data representing one or more words of said predetermined language contained in a speech of said predetermined language spoken by a speaker who mainly speaks a language other than the predetermined language, and comprises word correlating means for correlating each of one or more speech data items of words of said predetermined language to a word of said predetermined language obtained by recognizing each of the one or more speech data items and/or one or more words of said predetermined language which are possibly spoken by said speaker, and speech data recognition means for selecting one of words each correlated to each of one or more speech data items of said words as a result of recognition of each of one or more speech data items of said words.
The first speech recognition method of this invention recognizes words from speech data representing one or more words contained in a speech and comprises the steps of; correlating each of one or more of said speech data items of words to one or more sets of candidates (candidate words) comprising a combination of one or more of said words obtained by recognizing each of one or more of said speech data items, correlating each of said candidate words correlated to each of one or more of the speech data items of said words to null or more sets of a combination of one or more of said words (analogous words) which may correspond to pronunciation of each of said candidate words, and selecting either said candidate word correlated to each of on or more of said speech data items of words or said analogous word correlated to each of said candidate word as a recognition result of each of said speech data items of words.
The second speech recognition method of this invention recognizes one or more words of said predetermined language contained in a speech of said predetermined language spoken by a speaker who mainly speaks a language other than the predetermined language, and comprises the steps of; correlating each of one or more speech data items of words of said predetermined language to a word of said predetermined language obtained by recognizing each of the one or more speech data items and/or one or more words of said predetermined language which are possibly spoken by said speaker, and selecting one of words each correlated to each of one or more speech data items of said words as a result of recognition of each of one or more speech data items of said words.
The speech correcting apparatus of this invention comprises; candidate word correlating means for correlating each of one or more of said speech data items of words to one or more candidates of words (candidate words) obtained by recognizing said speech data items indicating the words, analogous word correlating means for correlating each of said candidate words correlated to the speech data items to null or more words (analogous words) which may correspond to pronunciation of each of said candidate words, and pronunciation correcting data output means for outputting pronunciation correcting data corresponding to the same analogous word indicated by said speech data item and correcting the pronunciation of the word indicated by said speech data item when the word indicated by said speech data item matches said analogous word correlated to each of said candidate words which are correlated to said speech data item.
In the speech correcting apparatus of this invention, the candidate word correlating means and analogous word correlating means correlate the speech data items to the candidate words and the analogous words in the manner similar to the speech recognition apparatus of this invention described in the above.
When the speaker pronounces as correct as a native speaker, the word intended by the speaker and the result of recognizing the speech data will be included in the candidate word. On the other hand, if the speaker""s pronunciation is wrong or obscure, the result of recognizing the speech data item is included in the analogous word though the word intended by the speaker is included in the candidate word. Therefore, when a speaker is given a word to pronounce and pronounces that word, and if that word matches an analogous word as a result of recognizing the speech data item, it is meant that the pronunciation by a user (speaker) contains some error or the pronunciation is obscure.
When the word given to the speaker matches an analogous word, the speech correcting data output means displays information correcting the error or obscurity of the pronunciation (for example, image data showing the movement of mouth and tongue of a native speaker in pronouncing correctly, and text data showing a sentence telling which part of speaker""s pronunciation is wrong when compared to a native speaker) in a monitor, prompting the speaker to correct the pronunciation and assisting the learning so that the speaker""s pronunciation approaches a native speaker""s pronunciation.
The speech correcting method of this invention comprises the steps of, correlating each of one or more of said speech data items of words to one or more candidates of words (candidate words) obtained by recognizing said speech data items indicating the words, correlating each of said candidate words correlated to the speech data items to null or more words (analogous words) which may correspond to pronunciation of each of said candidate words, and outputting pronunciation correcting data corresponding to the same analogous word indicated by said speech data item and correcting the pronunciation of the word indicated by said speech data item when the word indicated by said speech data item matches said analogous word correlated to each of said candidate words which are correlated to said speech data item.