This invention relates to identification of the language of a speaker using a voice system. In particular, it relates to extraction of articulatory factors from an acoustic signal to distinguish between different languages and further identify the original accent of a foreign speaker.
In a multilingual environment, IVR (interactive voice response) services need to enter into an initial negotiation with the caller to establish which language should be used for outgoing prompts. To provide switching of this kind automatically would be an advantage. One current method requires the caller to say a key word which may be recognized directly out of a multilingual recognition vocabulary or have the incoming speech presented to several language specific models and use response time and confidence value to determine the language used. Calling line identification (CLID) is used for previously identified telephone lines, but if a different caller uses the telephone line, it will not provide complete robustness. Another method is to request, via DTMF selection, that the caller make an explicit choice.
One language recognition method uses phoneme analysis on whole utterances. U.S. Pat. No. 5,636,325, assigned to IBM Corporation, discloses a system for speech synthesis and analysis of dialects. A set of intonation intervals, for a chosen dialect are applied to the intonational contour of a phoneme string derived from a single set of stored linguistic units, e.g., phonemes. Sets of intonational interval are stored to simulate or recognize different dialects or languages from a single set of stored phonemes. The interval rules preferably use a prosodic analysis of the phoneme string or other cues to apply to a given interval to the phoneme string. A second set of interval data is provided for semantic information. The speech system is based on the observation that each dialect and language possess its own set of musical relationships or intonation intervals. These musical relationships are used by a human listener to identify the particular dialect or language. The speech system may be either a speech synthesis or speech analysis tool or may be a combined speech synthesis/analysis system.
Another known language recognition method uses phonetic analysis of vowel sounds. U.S. Pat. No. 5,689,616 discloses a language identification and verification system whereby language is determined by finding the closest match of a speech utterance to multiple speaker sets. It is implemented using speaker baseline references in a plurality of languages and comparing unknown speech input with the references to find the closest fit. The system uses phonetic speech features derived from vocalic or syllabic nuclei using Hidden Markov Model analysis and comparing with stored phonetic references.
The segment based and syllabic nuclei approaches require segment identification of the individual phonemes, but these approaches are not ideal for applications where there is no speech recognition capability. IVR services which do not support speech recognition, do not have the resources to perform phoneme recognition, and there is a need to perform language identification with less of a resource requirement.
In one aspect of the invention there is provided a method of determining a language set for use in an interactive voice response system comprising the steps of providing a plurality of samples from a voice signal, calculating a non-phonetic characteristic of each sample, and selecting a corresponding language set based on the non-phonetic characteristic.
In one embodiment, the non-phonetic characteristic is based on a first and second formant frequency for each sample. In another embodiment, it may be based on the fundamental frequency contour. In another embodiment, the duration of voicing, and in another embodiment the bandwidth characteristics in the spectral sections.
In an embodiment, the non-phonetic characteristic is based on the average first and second format frequency for the plurality of samples. In this, way a determination of the language category can be made without phonetic analysis and the resources associated with it.
Advantageously, there is provided a further step of calculating the displacement of each sample from the averaged first and second format frequency and calculating a second factor based on the average displacement of the samples, wherein the nearest matching reference is compared against the first and second factors. The formants are normalized to a theoretical ratio of 3F1=F2. The second format frequency is a weighted combination of the second and further format frequencies.
The first and second formants are only acquired for fully voiced samples where the fundamental frequency is not substantially zero.
The foregoing has outlined rather broadly the features and technical advantages of the present intention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention.