Automatic voice recognition has been used in practice for some time and is used for the machine translation of spoken language into written text.
According to the space/time link between voice recording and voice processing, voice recognition systems can be divided into the following two categories:    “Online recognizers” are voice recognition systems that translate spoken comments directly into written text. This includes most office dictation machines; and    “Offline recognition systems” execute time-delayed voice recognition for the recording of a dictation made by the user with a digital recording device, for example.
The state of the art voice processing systems known to date are not able to understand language contents, i.e., unlike human language comprehension, they cannot establish intelligent a priori hypotheses about what was said. Instead, the acoustic recognition process is supported with the use of text- or application-specific hypotheses. The following hypotheses or recognition modes have been widely used to date:    Dictation and/or vocabulary recognition uses a linking of domain-specific word statistics and vocabulary. Dictation and/or vocabulary recognition is used in office dictation systems;    Grammar recognition is based on an application-specific designed system of rules and integrates expected sentence construction plans with the use of variables; and    Single word recognition and/or keyword spotting is used when voice data to support recognition are lacking and when particular or specific key words are anticipated within longer voice passages.
A voice recognition system for handling spoken information exchanged between a human party and an automated attendant system is known, for example, from the document “Spoken Language Systems—Beyond Prompt and Response” (BT Technol. J., Vol. 14, No. 1, January 1996). The document discloses a method and a system for interactive communication between a human party and an automated attendant system. The system has a voice recognition capability that converts a spoken comment into a single word or several words or phrases. Furthermore, there is a meaning extraction step, where a meaning is attributed to the recognized word order, with the call being forwarded by the automated attendant system to a next step based on said meaning. By means of a database search, additional information can be obtained for a recognized word. Based on the recognized and determined information, a response is generated, which is transformed into spoken language by means of a voice synthesizer and forwarded to the human party. If the human party communicates with the automated attendant system through a multi-modal system,(e.g., an Internet, personal computer with voice connection), it can be provided with information determined by the automated attendant system visually on the screen and/or acoustically through the microphone of the personal computer and/or headsets. For further details, reference is made to the aforementioned document and the secondary literature cited therein.
Despite this high degree of automation, such voice recognition systems are problematic especially with respect to the recognition of the voice information unless the voice recognition system was adjusted to the specific pronunciation of a person in the scope of a learning phase because pronunciation differs from person to person. Especially automated attendant systems, where one party requests information or provides information, are not yet practicable because of the high error rate during the voice recognition process and the various reactions of the individual parties. Thus, many applications still require the use of a second party rather than an automated attendant system to take the information provided by the first party or give out information. If the second party receives information, the information—regardless of form—usually must be recorded, written down, or entered into a computer.
Furthermore, it is often necessary to follow-up on such calls, for example, to reconstruct in the case of sales talks or contract negotiations what was said by whom and in what context. The follow-up from memory or from scribbled notes is often incomplete and it is difficult to reconstruct the timeline. Although recordings on voice recorders are possible, they are difficult to integrate into the current data processing landscape. Further, digital recordings of the acoustic data require a greater memory capacity.
These procedures not only require a high personnel effort, but they also time-consuming, thus making the call throughput as well as the follow-up less than optimal.
Another problem exists when a great number of calls are made and have to be found quickly and easily if they are stored in any form. For example, easy access to the call data is desired in particular for statistical purposes.
In addition, it would be advantageous if it were possible to identify a party automatically.