1. Technical Field
The present invention relates in general to data processing systems, and in particular to a method and system for enhanced speech recognition environment on a data processing system. Still more particularly, the present invention relates to a method and system for providing voice dynamics in a speech-to-text application within a speech recognition environment on a data processing system.
2. Description of the Related Art
Human speech recognition technology has been around for several years and is well known in the art and is commercially available. Speech analysis and speech recognition algorithms, machines, and devices are becoming more and more common. Such systems have become increasingly powerful and less expensive. Those familiar with the technology are aware that various applications exist which recognize human speech and stores it in various forms on a data processing system. One extension of this technology is in speech-to-text application which provides a textual representation on a data processing system of human speech. Speech recognition software is being utilized every day by hundreds of thousands of people.
Speech-to-text applications have evolved as one of the ultimate goals of speech recognition technology. Many current applications utilize this technology to convert spoken language into text form which is then made accessible to a user of the data processing system.
Within recent years, an explosion in the utilization of voice recognition systems has occurred. One goal of voice recognition systems is to provide a more humanistic interface for operating a data processing system. Voice recognition systems, typically, are utilized with other input devices, such as a mouse, keyboard, or printer, to supplement the input/output (I/O) processes of voice recognition systems.
Some common examples of the implementation of voice recognition technology are Dragon.TM. (a product of COREL) and ViaVoice.TM. and IBM Voicetype.TM., both products of International Business Machines Corporation (IBM).
ViaVoice Executive Edition is IBM's most powerful continuous speech software. ViaVoice Executive offers direct dictation into most popular Windows applications, voice navigation of your desktop and applications and the use of intuitive "natural language commands" for editing and formatting Microsoft Word documents.
In order for voice recognition be useful to a user of a data processing system, various means of outputting the human speech signal for user interface is required. This aspect of human speech recognition is quickly developing and is well known in the art.
Standard Generalized Markup Language (SGML) has been developed to provide additional information when outputting text to provide a recipient with a more detailed output. The Java Speech Markup Language (JSML) is particularly developed for marking up text that will be spoken on devices incorporating the java speech API (Java is a trademark of Sun Microsystems, Inc.).
The Java Speech Markup Language is utilized by applications to annotate text input to Java Speech Application Programming Interface (JSAPI) speech synthesizers. The JSML elements provide a speech synthesizer equipped with the JSAPI with detailed information on how to say the text. JSML includes elements that describe the structure of a document, provide pronunciations of words and phrases, and place markers in the text. JSML also provides prosodic elements that control phrasing, emphasis, pitch, speaking rate, improves the quality and naturalness of the synthesized voice. JSML utilizes the Unicode character set so JSML can be utilized to markup text in most languages.
The current market consists of different forms of voice recognition. These different forms are: Speaker Dependent, Speaker Independent, Command * Control, Discrete Speech Input, Continuous Speech Input and Natural Speech Input.
Natural Speech Input is the ultimate goal in Voice Recognition Technology. To be able to talk to your computer in no specific manner and have the computer understand what the user wants, then apply the commands or words. One aspect of natural speech input is the ability to capture speaker voice dynamics to convey additional meaning to the text created. Currently no application exists which can capture speech dynamics and convert them to a text document representing the spoken text.
As voice recognition technology evolves, there will be a need to facilitate the retention of subtleties often lost in the process. Much of a verbal message's value is in the tone, emphasis inflection, volume, etc., which is mostly or entirely lost today. If all or part of this information content could be captured and passed along with the text message created through speech-to-text software, the formation content to the recipient would be greatly enhanced.
Further, although speech capture is well known, no current method or application exists which bridges the gap between speech recognition and speech-to-text technology to the creation of a marked-up text which exhibits the speech dynamics such as volume, pitch, range, and rate. Currently, most Extended Markup Language (XML) is prepared by hand utilizing no JSAPI specific editors.
It would therefore be desirable to have a method and system for enhanced recognition of speech, including recognition of its dynamics such as volume, pitch and tone. It would further be desirable to allow the real-time representation of such voice dynamics with speech in its textual form. It would further be desirable if such captured voice dynamics were capable of being transmitted along with the text representation to an audible output as a marked up document.