A typical way of transforming speech into text is to create and dictate a document, which is then temporarily recorded by a recording apparatus such as a tape recorder. A secretary, a typist, or the like reproduces the dictated contents using a documentation apparatus such as a typewriter, word processor, or the like.
Along with a recent breakthrough in speech recognition technology and improvement in performance of personal computers, a technology for documenting voice input through a microphone connected to a personal computer by recognizing speech within application software running in the personal computer, and displaying the document has been developed. However, it is difficult for a speech recognition system to carry out practical processing within an existing computer, especially a personal computer because the data size of language models becomes enormous.
Inconveniently, such an approach necessitates either training of a computer to respond to a single user having a voice profile that is distinguished through training or a very small recognisable vocabulary. For example, trained systems are excellent for voice speech recognition applications but they fail when another user dictates or when the correct user has a cold or a sore throat. Further, the process takes time and occupies a large amount of disk space since it relies on dictionaries of words and spell and grammar checking to form accurate sentences from dictated speech.
Approaches to speech synthesis rely on text provided in the form of recognisable words. These words are then converted into known pronunciation either through rule application or through a dictionary of pronunciation. For example, one approach to human speech synthesis is known as concatenative. Concatenative synthesis of human speech is based on recording waveform data samples of real human speech of predetermined text. Concatenative speech synthesis then breaks down the pre-recorded original human speech into segments and generates speech utterances by linking these human speech segments to build syllables, words, or phrases. Various approaches to segmenting the recorded original human voice have been used in concatenative speech synthesis. One approach is to break the real human voice down into basic units of contrastive sound. These basic units of contrastive sound are commonly known as phones or phonemes.
Because of the way speech to text and text to speech systems are designed, they function adequately with each other and with text-based processes. Unfortunately, such a design renders both systems cumbersome and overly complex. A simpler speech-to-text and text-to-speech implementation would be highly advantageous.
It would be advantageous to provide with a system that requires reduced bandwidth to support voice communication.