1. Field of the Invention
The present invention relates to a sound synthesizing apparatus. More specifically, the present invention relates to a sound synthesizing apparatus wherein a sound element is extracted from an analog sound waveform the time axis of which is compressed and a portion of the waveform of the sound element is subjected to expansion of the time axis, whereby a sound is synthesized that has substantially the same frequency component distribution but has the time duration which is different from the original time duration.
2. Description of the Prior Art
An exchange of information in terms of a sound signal, i.e. a conversation, has an emotional characteristic that is causes a reduced information transmission efficiency. More specifically, the speed of talking by a human being is 110 to 170 words per minute at the most, although a human has an ability to follow in listening to talking at a speed as high as two to three times of a normal speaking speed. Therefore, if sound information of such as a human voice as recorded on a magnetic tape by means of a tape recorder can be reproduced at such a higher speed as comprehensive, it would be much convenient. If such could be achieved, then the contents of a conference, lecture and the like of say one hour can be listened to within half an hour or less, other sound information such as are recorded curriculum can be retrieved at a high speed, and other applications could be developed.
If and when a recorded sound is reproduced at a speed higher than the recording speed, i.e. on the occasion of high speed reproduction, a reproduction time period can be shortened in reverse proportion to the reproduction speed but the reproduced sound frequency increases in proportion to the reproduction speed. A change of the frequency of the reproduced sound that occurs on the occasion of a higher reproduction speed is readily perceived by a listener. Nevertheless, the contents of the reproduced sound can be understood, if the reproduction speed does not exceed a speed as high as 1.5 times the normal speed. However, the contents of the reproduced sound can hardly be understood, when the reproduction speed exceeds two times that of the normal speed.
In order to correct the distortion of the waveform by reproduction at an increased speed, it is necessary to regain the original waveform of the reproduced sound in terms of the time axis. To that end, a variety of research and development have been carried out in the past. One approach is to analyze the spectrum of the sound signal on a real time basis for frequency conversion in a Fourier region, whereupon a reverse synthesization is made. Although this approach allows for a reproduced sound of a good quality, a large scale system is required, which is extremely expensive and hence is of less practicability.
Apparatus employing a relatively simple electronic circuit for time axis conversion of the sound have been proposed and put into practical use. The principle of such sound time axis conversion is shown in FIG. 1. Referring to FIG. 1, an analog sound signal the time axis of which has been compressed is divided at very short time intervals into a succession of sound elements, a portion of each sound element is discarded, the remaining portion of each sound element is expanded in terms of the time axis, and the remaining portion of each sound element, as expanded, is then jointed in a sampling cycle sequence, whereby a reproduced sound of the same frequency as the original sound is obtained with the contents of the reproduced sound condensed in terms of the time axis by discarding a portion of each sound element. Briefly described, the above described sound processing approach is equivalent to a process wherein a recorded magnetic tape is cut into pieces of a predetermined length and every second piece is picked up and compiled into one magnetic tape. Since the magnetic tape after compilation is shorter than the original magnetic tape is, reproduction of the compiled magnetic tape at a normal speed can provide a reproduced sound without alteration of the frequency components of the sound but within a shortened period of time as compared with the time period required for reproduction of the original magnetic tape at a normal speed by a value corresponding to the length of the magnetic tape portions as discarded. Fortunately, the fundamental syllables constituting a talking of the human being have much redundancy and sample duration, say 160 ms on the average, sufficient enough to make the talking comprehensive, even if a portion of the sound is intermittently dropped.
Now a specific scheme for expanding reproduction of a sound waveform as compressed in terms of the time axis through high speed reproduction, as shown in FIG. 1, will be described in the following.
One of such approaches is a digital memory system, which is fully described in Lee, F. F., "Time Compression and Expansion of Speech by the Sampling Method" Audio Engineering Society Preprint, presented as AES 42nd Convention, May, 1972. Another such approach is an analog memory system, which is fully described in Iwamura and Ono, "Capacitor Memory Apparatus", Electronic Communication Society, Conference Text No. 817, September, 1969 and Koshigawa and Tanizoe, "TSC Functioned Cassette Tape Recorder", "Electric Wave Science" February, 1974. A further such approach is a variable delay system, which is fully described in Shiffman, M. M., "Playback Control Speeds or Slows Taped Speech without Distortion" Electronics, Vol. 47, No. 17, Aug. 22, 1974. Still another approach is an analog shift register switching system, which is fully described in U.S. Pat. No. 3,936,610, issued Feb. 3, 1976 to Murray M. Schiffman, Newton, Mass. and entitled "Dual Delay Line Storage Sound Signal Processor".
The present invention is directed to an improvement in such an analog shift register switching system. Therefore, the prior art analog shift register switching system, previously proposed, will be first described in detail in the following.
FIG. 2 is a block diagram showing an example of a sound synthesizing apparatus in accordance with a prior art analog shift register switching system that constitutes the background of the invention. Referring to FIG. 2, an input terminal 1 is connected to receive an analog sound signal obtained through high speed reproduction. The analog sound signal obtained from the input terminal 1 through high speed reproduction is applied through analog switches 6 and 8 to analog shift registers 3 and 4, respectively, each comprising a bucket brigade device of N bits. The outputs of these analog shift registers 3 and 4 are withdrawn through analog switches 7 and 9, respectively, and further through a low pass filter 5 from an output terminal 2. The output terminal 2 provides a recovered analog sound signal obtained as a result of time axis expansion and synthesization by joining pieces of sound elements as expanded, as to be more fully described subsequently. The analog switches 6 and 9 are coupled from the Q output of a frequency divider 11 and the analog switches 8 and 7 are coupled from the Q output of the frequency divider 11, so that these analog switches are on/off controlled responsive to the outputs of the frequency divider 11. The frequency divider 11 is structured to achieve frequency division of the clock pulses obtainable from a write clock generator 10 by the factor 1/mN, where m and N are integers, m being described subsequently, whereby the output is alternately obtained by way of the output Q or Q. The output of the write clock generator 10 and the Q output of the frequency divider 11 are applied to an AND gate 12. The output of the write clock generator 10 and the Q output of the frequency divider 11 are applied to an AND gate 13. On the other hand, the clock pulse from a read clock generator 16 is applied to an AND gate 17, which is also connected to receive the Q output of the frequency divider 11. The clock pulse from the read clock generator 16 is also applied to an AND gate 18, which is also connected to receive the Q output of the frequency divider 11. The outputs of the AND gates 12 and 18 are applied through an OR gate 14 to the analog shift register 4 as a write clock pulse and a read clock pulse, respectively. Similarly, the outputs of the AND gates 13 and 17 are applied through an OR gate 15 to the analog shift register 3 by way of a write clock pulse and a read clock pulse, respectively.
FIG. 3 is a timing chart for use in explaining the operation of the FIG. 2 system. Referring to FIG. 3, the operation of the FIG. 2 system will be described in the following. During a time period n where the Q output of the frequency divider 11 assumes the logic one, the analog switches 8 and 7 are enabled. At that time, the write clock pulse having a frequency f1 obtainable from the write clock generator 10 is applied through the OR gate 14 to the analog shift register 4, while the read clock pulse having a frequency f2 obtainable from the read clock generator 16 is applied through the OR gate 15 to the analog shift register 3. Accordingly, the analog sound signal having the time axis compressed by the factor m applied to the input terminal 1 is successively loaded into the analog shift register 4 as a function of the write clock pulse in the form of a train of a plurality (mN) samples. However, the analog shift register has an N-bit capacity. Therefore, a smaller plurality (mN-N) samples from the leading end are shifted out from the output terminal of the analog shift register 4 during this period of time t1. However, since the analog switch 9 connected to the output terminal of the analog shift register 4 has been disabled at that time, the signal thus shifted out from the analog shift register 4 is blocked by the analog switch 9.
Then the state of the frequency divider 11 is reversed, whereby the Q output becomes the logic one during the following period n+1. During this period n+1, the analog switches 6 and 9 are enabled, while the analog switches 8 and 7 are disabled. As a result, the write clock pulse having the frequency f1 is applied through the OR gate 15 to the analog shift register 3, while the read clock pulse having the frequency f2 is applied through the OR gate 14 to the analog shift register 4. Accordingly, the N-bit sample previously loaded in the analog shift register 4 are in succession read out through the analog switch 9 in response to the read clock pulses of the frequency f2. The analog shift register 3 operates in a reverse manner, such that a read operation is performed during the period n and a write operation is performed during the period n+1. The frequency f1 of the write clock pulse and the frequency f2 of the read clock pulse are selected to satisfy the following equation. EQU f1/f2=m (1)
Thus, if the frequencies f1 and f2 of the clock pulses are determined as described above, the time axis of the output sound signal is expanded by m times and the compressed analog sound signal applied to the input terminal 1 is withdrawn from the output terminal 2 as a reproduced sound signal the time axis of which is recovered to the same as that of the original sound signal. Meanwhile, the frequency f2 of the read clock pulse should be determined to satisfy the sampling theory with respect to a necessary output sound frequency band.
The sound quality of the reproduced sound thus obtained from such sound synthesizing apparatus should be good enough not only to enable comprehension of the contents of talking but also to sound like an audible natural-like sound. By way of a criterion as to accuracy with which the linguistic contents are transmitted by a sound, the concept of articulation or intelligibility has been proposed and utilized. The articulation is a percentage of the fundamental constituting elements of a sound for linguistic representation such as a monotone, syllable and the like that are understood correctly by a listener in a communication system. The word "articulation" is customarily used when the contextual relationships among the units of speech material are thought to play an unimportant role. On the other hand, the word "intelligibility" is customarily used when the context is though to play an important role in determining the listener's perception. Either of them is tested by the use of an articulation test table or an intelligibility test table adopted by the Japanese Acoustic Society or the Counsel Committee of International Telegram and Telephone. Thus, it is required that the articulation or the intelligibility of high speed reproduction should be 100 percent at the reproduction speed ratio most often used, say the ratio m is approximately 2. As far as the articulation or the intelligibility is concerned, any of the above described approaches provides a satisfactory result.
The naturalness of a synthesized sound with respect to the original sound obtained by joining short sound elements is also determined, depending on the length of each sound element and processing at the junction. The length of the sound elements, i.e. the repetition period, shown in FIG. 1 was investigated by changing the length to various values and actually comparing the reproduced sound and the result shown in FIG. 4 was obtained. More specifically, FIG. 4 is a graph showing a relation betwen a sound quality and a repetition period, wherein the abscissa indicate a repetition period and the ordinate indicate a sound quality. The graph was obtained in the manner described in the following. The voice of a male announcer was recorded in a magnetic tape and the sound was reproduced at a reproduction speed ratio of m=2. The reproduced sound was listened to by a plurality of persons and the quality of the sound as listened to was graded in five grades, such as E standing for excellent, G standing for good, F standing for fair, P standing for poor, and B standing for bad. The curve shown in FIG. 4 was plotted by allotting 4, 3, 2, 1 and 0 to the grades E, G, F, P and B, respectively, and adopting the average. Generally, it is difficult to represent the naturalness or audibility of a sound in a quantitative manner and presently such representation is an unsolved field in the acoustic phonetics. Thus, in most cases, such a psychometorical approach based on a subjective judgement has been employed for convenience sake. According to the data shown in FIG. 4, it could be concluded that a proper length of the sound element is 25 to 45 msec. If the repetition period becomes smaller than 25 msec, the number of junctions between adjacent sound elements appearing on the waveform increases, which degrades a sound quality. On the other hand, a sound is also constituted by a time transition of a frequency spectrum, as to be more fully described subsequently and therefore an increase in the repetition period or the length of the sound element accordingly increases unnaturalness by virtue of the discontinuity at the junction of the adjacent sound elements.
A method for joining the adjacent sound elements or processing the junction between the adjacent sound elements considerably influences the quality of a sound obtained by such type of sound synthesizing apparatus. Firstly, a discontinuity of the waveform of the sound signal occuring at the junction of the adjacent sound elements causes a harmonic noise, which reduces the signal to noise ratio of the reproduced and synthesized sound, whereby articulation is degraded. On the other hand, a auditory sensation of a human being is extremely sensible to a variation of the pitch frequency which is a fundamental frequency of the vocal cord vibration. Thus, if and when the pitch frequency components are discontinuous at the junctions, the sound is unnatural and disagreeable to hear. When the pitch frequency components are discontinuous at the junction, the sound is heard as if phlegm obstructs the throat.
Any of the above described approaches can not essentially avoid occurrence of the harmonics and the discontinuity of the pitch frequency components at the junctions. The harmonics noises caused by the discontinuity of the waveforms at the junctions between the adjacent sound elements can be removed by filters to some extent. As described previously, the sampling repetition period is selected to be about 25 to 45 msec. Therefore, assuming that the sampling repetition period is selected to be 25 msec, then the fundamental components of the noise caused by the repetition is about 40 Hz. Since a frequency spectrum higher than 100 Hz is sufficient as an ordinary sound, the above described noises can be removed by using a high pass filter for cutting off the above described lower frequency components. Similarly, other noise components of the frequencies higher than a necessary sound frequency region can be removed by using a low pass filter of a proper frequency characteristic. Nevertheless, any noise components occuring in the necessary sound frequency region can not be removed by any conventional means. Moreover, no proper countermeasures have been provided to the discontinuity of the pitch frequency components.
Although a reproducing apparatus such as a tape recorder for high speed reproduction could provide wide applications and therefore have been eagerly waited for, such apparatus has not been widely used. It is not too much to say that the reason is that the naturalness of the sound quality of the synthesized sound is not sufficient yet even if the contents of the reproduced sound signal are perceptible.