In speech recognition systems, the main object is to make a machine understand an utterance made by a human speaker. Thus, speech recognition is used for facilitating a man-machine interface (MMI) by means of allowing entering of commands, text and data to the machine directly from the speech.
In speech recognition, the task for the computer is to transform the acoustic input signal into a text, so called transcription. The characteristics of the input signal varies within a broad range for the same word depending on the sex, age, dialect, etc. of the speaker. Furthermore, if several words are entered into the system at the same time, for example if a whole sentence is given to the speech recognition system, the pronunciation of the different words may differ depending on the words preceding and/or succeeding a present word.
Furthermore, the presence of noise and echoing effects may distort the original signal before it enters the speech recognition system.
In general, speech recognition systems can be divided into two main groups:                i) Speaker independent systems and        ii) Speaker dependent systems        
Speaker independent systems, in particular those designed for a large vocabulary and for accepting speech without pausing between the different words, i.e. sentences or parts thereof, requires the use of large speech data bases and use different statistical properties of speech and words. Grammatical rules and predictions of what is likely to be said can also be incorporated in such systems.
Speaker dependent systems, on the other hand, and in particular those using a limited vocabulary (typically a few hundred words) and where only one word is spoken at the time, does not require any large data bases. Instead such systems requires training of the particular speaker, or, in some cases speakers, using the system.
A speaker dependent speech recognition system will of course provide a much better performance compared to a speaker independent system for a number of reasons. For example, the number of words is limited and also the system has an exact knowledge of how a particular word should sound, since it has been trained by the particular person using the system.
However, a speaker dependent system can only be used for a limited range of applications. An application, in which a speaker dependent system is to prefer to a speaker independent system is for example entering of commands to a machine.
In such a case the task for the speech recognition system is to transcript the command given orally into a form which can be understood by the machine, i.e. usually a binary word, which is used for controlling the machine. For example, commands such as “Go”, “Stop”, “Left”, “Right” “Yes”, “No”, etc. can be given orally to a machine, which then executes the corresponding actions.
Nevertheless, even though the number of possible words that the machine has to recognise is limited, typically to a few hundred words, and even though the speech recognition system of the machine has been trained by the voice of the user and therefore has an exact knowledge of how a particular word sounds when spoken by that particular user, a number of possible sources for making a wrong decision still exists.
Thus, noise and echoing effects in the environment will distort the signal entering the speech recognition system. Also, the frequency spectrum of the same word will experience small variations from time to time, and in particular if the speaker has a cold or the like.
Another problem is that the number of words, even though limited to, typically a few hundred, requires a very large amount of processing power. In a typical speech recognition system the sampling rate is 8000 samples per second and where each sample consists of about 13 bits. This results in that a typical word, which typically lasts for a second, consists of about 100 000 bits.
Thus, in a system where real time constrains exists, for example requiring a response time of 1 second or less, the speech recognition system has to be able to process the large amount of information contained in each word very quickly.
Furthermore, the computational load on the system increases heavily when the number of words increases. This is due to a number of different reasons. Thus, the system has to search a greater number of words when trying to determine which word or command has been spoken. Also, when the number of words/commands increases the risque for that a given command has characteristics which resembles another command increases. In order to avoid a faulty decision the system then has to extract more features from the different words in order to make a correct decision with a required probability. Finally, the possibility that the system interprets a non-existing command word as a command increases if the number of word increases, i.e. the performance of the out of vocabulary rejection (OVR) function decreases.
In a system, which is designed to operate under difficult conditions, such as a mobile telephone comprising a voice controlled dialling system (VCD), i.e. having means for receiving commands orally, and which may be used in a car, the accuracy of existing speech recognition systems is in most cases too low.
A system using speech recognition for entering commands is described in U.S. Pat. No. 5,386,494. The system as described in U.S. Pat. No. 5,386,494 displays a number of different icons on a screen. By selecting a certain icon a user can limit the possible commands to the commands associated with the selected icon shown on the screen. However, it is difficult to use such a system in a mobile telephone, which usually lacks a suitable graphical display.
Also, U.S. Pat. No. 5,515,475 describes a speech recognition system designed to build word models starting from phonemes or allophones.