A phone is the minimum unit of sound articulated during speech. A phoneme is the minimum unit of speech that distinguishes one word from another. A phoneme may consist of one or more phones, and variations of phones (i.e., allophones) may be used without changing the meaning of the corresponding word. The exact number of phonemes in English depends on the speaker, but it is accepted that English contains between 40 and 45 phonemes, which is about average. At the extremes, one language contains only 10 while another contains 141.
Phonetic writing, or transcription, is the representation of the sounds of speech with a set of distinct symbols. For example, the letters used to spell the word “call” are intended to represent the meaning of the word, whereas the phonetic representation of the word (i.e., “kO1”) is meant to represent how the word should be pronounced. The primary script used for phonetic writing is the International Phonetic Alphabet.
Speech processing applications often require a database of writings (corpus) or a number of such databases (corpora). Corpora have been generated with and without phonetic information. Phonetic information is required for any speech processing application that performs an interpretation of, or conversion to, a spoken sound.
Speech data is phonetically labeled either manually or by an automatic method commonly referred to as brute-force alignment or force alignment for short. In force alignment, each phone in question is associated with the closest phonetic sound available. For example, if a book written in English were force aligned using two different databases of phones where each database had a wide enough selection of phones to cover all of the words used in the book (e.g., English and Italian), the results would be understandable to an English speaking person, but the result that used the Italian phone database would have a distinctive Italian sound to it because the closest sounding Italian phone would not exactly match the sound of the corresponding English phone.
With the every increasing globalization of activities, it is becoming more important for speech processing application to process more than one language and, therefore, recognize the phones of more than one language.
One approach to adding multi-lingual capability to speech processing applications is to combine corpora which have been tailored to a specific language. However, blindly combining corpora which have been phonetically transcribed by different methods or people often produce worse results than just using one corpus. The reason for this is that inaccuracies may be introduced by a method or person having its own threshold for determining when a textual unit matches a phone and vice versa. Since the difference in sound between one word and a word of another meaning can be very slight, any inaccuracy in the interpretation or conversion of a sound could result in something that is totally unintelligible. For example, in has been reported that voice recognition systems trained on corpora of American English speakers do a poor job of interpreting the words of British English speakers.
In an article entitled “Learning Name Pronunciations in Automatic Speech Recognition Systems, Francoise Beaufays et al. disclose a method of learning proper name pronunciation by finding the phone sequence that best matches a sample speech waveform. The method employs linguistic knowledge to determine if the resulting pronunciation is linguistically reasonable.
U.S. Pat. No. 5,758,023, entitled “MULTI-LANGUAGE SPEECH RECOGNITION SYSTEM,” discloses a device for and method of transcribing speech into one of many pre-determined spoken languages by identifying phones, combining the phones into phonemes, and translating the phonemes into the desired foreign language. U.S. Pat. No. 5,758,023 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,950,159, entitled “WORD SPOTTING USING BOTH FILLER AND PHONE RECOGNITION,” discloses a device for and method of word spotting by processing acoustic data to identify phones, generate temporal delimiters, generate likelihood scores, identifying sequences of phones, and using the temporal delimiters and likelihood scores to recognize keywords. U.S. Pat. No. 5,950,159 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 6,073,095, entitled “FAST VOCABULARY INDEPENDENT METHOD AND APPARATUS FOR SPOTTING WORDS IN SPEECH,” discloses a device for and method of spotting words by using Viterbi-beam phone level decoding with a tree-based phone language model. U.S. Pat. No. 6,073,095 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 6,385,579 B1, entitled “METHODS AND APPARATUS FOR FORMING COMPOUND WORDS FOR USE IN A CONTINUOUS SPEECH RECOGNITION SYSTEM,” discloses a device for and method of identifying consecutive word pairs and replacing the same with a corresponding compound word. U.S. Pat. No. 6,385,579 B1 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. Appl. No. 2003/0135356 A1, entitled “METHOD AND APPARATUS FOR DETECTING PROSODIC PHRASE BREAK IN A TEXT TO SPEECH (TTS) SYSTEM,” discloses a device for and method of processing speech by receiving text, identifying parts of speech, generating a part of speech sequence, detecting prosodic phrase break using a neural network, and generating a prosodic phrase boundary based on the prosodic breaks. U.S. Pat. Appl. No. 2003/0135356 A1 is hereby incorporated by reference into the specification of the present invention.
There is a need to add multi-lingual capability to speech processing application. To do this, one needs to be able to recognize phones from multiple languages.