The present invention relates to the processing, storage, transmission, and playback of spoken information.
A speech recognition engine is a computer program that converts a digital audio input signal into recognized speech in a text or equivalent form. Speech recognition is also referred to as automatic speech recognition (ASR). In general, speech recognition engines analyze a digitized audio input signal, generally by characterizing the frequency spectrum of the incoming signal; recognize phonemes in the characterized input signal; recognize words or groups of words, generally using a vocabulary, a grammar, or both; and generate results to a calling application. Some engines provide results during the recognition process and provide, in addition to a best estimate, alternative estimates of what the speaker said.
A speech synthesis engine (a speech synthesizer) is a computer program that converts text into a digital audio signal that can be played to produce spoken language. Speech synthesis is also referred to as text-to-speech (TTS) conversion. TTS conversion involves structure analysis, determining where paragraphs, sentences and other structures start and end; analysis of special forms, such as abbreviations, acronyms, dates, times, numbers, currency, Web sites and so on; conversion of text to phonemes; analysis of prosody, such as pitch, rhythm, emphasis, and so on; and production of an audio waveform.
Application programs communicates with a speech recognition and a speech synthesis engine through the engine's application program interface (API). A number of standard application program interfaces exists. These include the Java Speech API, the Speech Recognition API (SRAPI), Microsoft Corporation's Speech API (SAPI) and IBM Corporation's Speech Manager API (SMAPI). The Java Speech API, for example, with the Java Synthesis Markup Language (JSML), provides the ability to mark the start and end of paragraphs and sentences for a synthesis engine; to specify pronunciations for any word, acronym, abbreviation or other special text representation; and to control explicitly pauses, boundaries, emphasis, pitch, speaking rate and loudness to improve the output prosody.
Speech recognition and speech synthesis engines are available from a number of vendors. These include AT&T, which offers WATSON; IBM Corporation, which offers ViaVoice; and Lernout & Hauspie Speech Products, which offers ASR and TTS Development Tools. Speech recognition engines, including engines that do recognition using hidden Markov modeling and neural networks, can generally provide a reliability measure for recognized words through a low-level interface. The reliability measure indicates the level of confidence the engine has that the associated word was recognized correctly. In a higher-level interface, a recognition threshold parameter can be used to set a normalized value from which an engine will report success or failure in recognition.