This invention relates generally to sound reproduction and speech synthesis on a data processing system. More particularly, it relates to a method, program and system for speech synthesis in which spatial information is added to a synthesized voice.
While the visual images presented by the personal computers compatible with those built by the IBM Corporation have undergone a continual evolution of improvement, the typical speaker system of such a computer remains a single, inexpensive speaker buried somewhere in the system unit. The sound emanating from the speaker is of poor quality being unidirectional, fuzzy and difficult to discern. The personal computer has been regarded as an important agent of change in many areas of society, including education. Nonetheless, repetitive tasks, such as language drills, which are not regarded with universal enthusiasm on the part of students even in the best of classroom situations, become even less appealing in the acoustically impoverished environment generated by a typical computer.
Yet high quality sound reproduction for a personal computer has only recently been regarded as particularly important with the advent of multimedia. Although not yet equal to even inexpensive stereo systems, some multimedia computer systems use two external speakers for two channel "stereo" sound. While stereo sound will help add excitement and intelligibility to multimedia applications, further improvements in sound quality from the personal computer and its application programming are necessary to exploit the full potential of multimedia.
The stereo art teaches some lessons which have application to generating high quality sound from a computer. Indeed, many multimedia applications store conventionally recorded audio such as a sound track on a tape or CD. This is not surprising, as a considerable effort has already been devoted to stereo and there is little need to reinvent the wheel. Researchers have been steadily refining stereo technology since the 1930s when Alan Blumlein in U.S. Pat. No. 2,093,540 taught the basic precepts upon which much of the audio art is built. Despite the vast body of improvements to the stereophonic/art, it remains true that a conventional recording does not faithfully reproduce the spatial sound field of the original sound space and tends to produce a less satisfying listening experience than a live performance.
An appropriately programmed computer differs in many important respects and possesses many additional capabilities than the most elaborate stereo systems. One of the more important differences is that the user's interaction with a computer is much greater than with a stereo system. Thus, the actions taken by the computer will tend to vary much more depending upon the actions of the user. It is difficult to anticipate all the actions which a user might take and record all of the appropriate responses, although some of the interactive CD technologies appear to be taking this route. Further, unless a user has access to sophisticated sound recording equipment, he will be unable to modify the stored program to include an audio at the same fidelity of the original.
Speech synthesis or text-to-speech programming is well known. It can provide a flexible means of entering new information into a program as a user merely needs to type alphanumeric text via the system keyboard. In addition, storage of the alphanumeric information requires much less storage than the audio waveform of conventional stereo technology. To date, however, speech synthesis has not been entirely acceptable in terms of the audio quality generated, and because of this poor quality is not generally regarded as suitable for inclusion in a multimedia presentation. Whatever the shortcoming of conventional audio with regard to the accuracy with which directionality and spatial information is reproduced, synthesized speech has no spatial attributes and is especially dull and lifeless. The poor sound generated by present day speech synthesis is almost antithetical to a multimedia presentation. Thus, improvements in speech synthesis are necessary before they can be truly integrated with multimedia.
The present invention provides one improvement, a means for producing a more exciting multimedia application using synthesized voices, each of a plurality of voices appearing to originate from a different location in three dimensional space.