This invention relates generally to speech and more particularly to speech recognition, compression, and transmission.
It has long been recognized that analog speech signals contain numerous redundant sounds so as to make such signals not suitable for efficient data transmission. In a direct human interaction situation this inefficiency is tolerable. The technical requirements to cope with inefficient speech transmission though become infeasible due to cost, time, and the increased memory storage which is rendered necessary because of the inefficiency.
A need exists for a system which can take an analog speech signal and translate it into a digital form which is reconstructable after transmission or storage. This type of device is generally referred to as a "vocoder".
A vocoder was discussed by Richard Schwartz et al in his paper entitled "A Preliminary Design of a Phonetic Vocoder Based on a Diphone Model" published in the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 80) proceedings of Apr. 9, 10, 11, 1980 in Denver, Colo. (ICASSP 80 vol. 1, pg. 32-35). The diphone model of Schwartz et al entails a phonetic vocoder operation at 100 b/s. With each phoneme of the speech, the vocoder generates a duration and single pitch value. An inventory of diphone templates is used to synthesize the phoneme string. Additionally the diphone templates are utilized to initially establish which phonemes are being transmitted in the analog speech. A diphone exists from the middle of one phoneme to the middle of the next phoneme. Due to the structure and stringing ability of a diphone, it is highly cumbersome in use and is generally ineffective in speech synthesis.
Diphone synthesis requires the use of an elaborate acoustic-to-phonetic rule algorithm so as to create intelligible speech. This extensive acoustic -to-phonetic rule algorithm requires a great deal of time and hardware to be effective.
Intrinsic to the recognition of an analog speech is the use of a methodology which breaks the analog speech into its component parts which may be compared to some library for identification. Numerous methods and apparatuses have evolved so as to approximate the human speech and to model it. These modeling techniques include the voder, linear predictive filters, and other devices.
One such method of analyzing the analog speech was discussed by James L Flanagan in the article "Automatic Extraction of Format Frequencies from Continuous Speech" first printed in J. Acoust. Soc. Am., Vol. 28, pp. 110-118, January 1956, incorporated hereinto by reference.
In the article, Flanagan discusses two electronic devices which automatically extract the first three formant frequencies from continuous speech. These devices yield continuous DC output voltages whose magnitudes as functions of time represent the formant frequencies of the speech. Although the formant frequencies are in an analog form, use of an analog-digital (AD) converter readily transforms these formant frequencies into digital form which is more suitable for use in an electronic environment.
Another method was discussed by H. K. Dunn in his article "Methods of Measuring, Vowel Formant Bandwidths" J. Acoust. Soc. Am., Vol. 33, pp. 1737-1746, December 1961, incorporated hereinto by reference. In the article, Dunn discloses the use of spectrums of real speech and the use of an artificial larynx in an application to real subjects.
It is clear therefore that an efficient methodology and apparatus for transformation of an analog speech signal to a approximating digital form does not exist. The mere recognition of formants or the use of diphones in the synthesis of the perceived speech is inaccurate and does not allow for quality recordation and transmission of data representation of the original speech signal.