In addition to providing printed telephone directories, telephone companies provide information services to their subscribers. The services may include stock quotes, directory assistance and many others. In most of these applications, when the information requested can be expressed as a number or number sequence, the user is required to enter his request via a touch tone telephone. This is often aggravating for the user since he is usually obliged to make repetitive entries in order to obtain a single answer. This situation becomes even more difficult when the input information is a word or phrase. In these situations, the involvement of a human operator may be required to complete the desired task.
Because telephone companies are likely to handle a very large number of calls per year, the associated labour costs are very significant. Consequently, telephone companies and telephone equipment manufacturers have devoted considerable efforts to the development of systems which reduce the labour costs associated with providing information services on the telephone network. These efforts comprise the development of sophisticated speech processing and recognition systems that can be used in the context of telephone networks.
In a typical speech recognition system the user enters his request using isolated word, connected word or continuous speech via a microphone or telephone set. The request may be a name, a city or any other type of information for which either a function is to be performed or information is to be supplied- If valid speech is detected, the speech recognition layer of the system is invoked in an attempt to recognize the unknown utterance. The speech recognition process can be split into two steps namely a pre-processing step and a search step. The pre-processing step, also called the acoustic processor, performs the segmentation, the normolisation and the parameterisation of the input signal waveform. Its purpose is traditionally to transform the incoming utterance into a form that facilitates speech recognition. Typically at this step feature vectors are generated. Feature vectors are used to identify speech characteristics such as formant frequencies, fricative, silence, voicing and so on. Therefore, these feature vectors can be used to identify the spoken utterance. The second step in the speech recognition process, the search step, includes a speech recognition dictionary that is scored in order to find possible matches to the spoken utterance based on the feature vectors generated in the pre-processing step. The search may be done in several steps in order to maximise the probability of obtaining the correct result in the shortest possible time and most preferably in real-time. Typically, in a first pass search, a fast match algorithm is used to select the top N orthographies from a speech recognition dictionary. In a second pass search the individual orthographies are re-scored using more precise likelihood calculations.
The performance of automatic speech recognisers depends significantly on various environmental factors such as additive noise, echoes, transmission and transducer characteristics, as well as the level of background speech and speech like sounds. These distortions are carried in the parameters and features of the input speech and can significantly alter the recognition results. In order to build speech recognisers that are robust under adverse operating conditions, data representing these conditions must be used to train the speech recognition models. However, in certain applications, such as telephone applications, the speech-input conditions are not known in advance and cannot be controlled. For example the caller may be calling from a restaurant or from a busy conference hall where the noise may significantly affect the speech recognition process.
Thus, there exists a need in the industry to refine the speech recognition process such as to obtain a more robust speech recognition apparatus in the presence of noise and background speech.