1. Field of the Invention
The present invention relates to control structures for computer-controlled sound synthesis.
2. State of the Art
The application of computers to sound synthesis has been studied and practiced for many years. Whereas the computer synthesis of simple sounds is straightforward, the problem of synthesizing complex, realistic sounds such as the human voice, the sound of a piano chord being played, a bird call, etc., has posed a continuing challenge.
One well-known technique of synthesizing complex sounds is that of additive synthesis. In conventional additive synthesis, a collection of sinusoidal partials is added together to produce a complex sound. To produce a complex, realistic sound may require as many as 1000 sinusoidal partials to be added together. Each sinusoidal partial must be specified by at least frequency and amplitude, and possibly phase. Clearly, the computational challenge posed in producing complex, realistic sounds by additive synthesis is considerable.
Furthermore, the greatest benefit is obtained when additive synthesis is used to produce complex, realistic sounds in real time. That is, the synthesis system should be able to accept a series of records each specifying the parameters for a large number of partials and to produce from those records a complex, interesting, realistic sound without any user-perceptible delay.
Two approaches to additive synthesis have been followed. In the first approach (the time-domain, or wavetable, approach), the equivalent of a bank of oscillators has been used to directly generate sinusoidal partials. The frequency and amplitude values of all of the partials have been applied to the oscillators in the oscillator bank, and the resulting partials have been added together to produce the final sound. The requirement of directly computing each partial individually has limited the number of partials that may be included in a sound so as to allow the sound to be produced in a reasonable period of time.
In the second approach (the frequency-domain approach), partials have been specified and added in the frequency domain to produce a spectrum, or frequency-domain representation, of the final sound. The inverse Fourier transform is then used to compute the time-domain representation of the final sound, from which the sound is then produced.
An IFFT additive synthesis technique is described in U.S. Pat. No. 5,401,897, incorporated herein by reference. In the described additive sound synthesis process, sample blocks are determined by carrying out the inverse Fourier transform of successive frequency spectra. The sample blocks are time-superimposed and added to form a sequence of samples representing a sound wave. The latter procedure is known as overlap-add.
Other patents relating to additive sound synthesis include the following: U.S. Pat. No. 4,856,068; U.S. Pat. No. 4,885,790; U.S. Pat. No. 4,937,873; U.S. Pat. No. 5,029,509; U.S. Pat. No. 5,054,072; and U.S. Pat. No. 5,327,518; all of which are incorporated herein by reference.
Prior art additive synthesis methods of the type described, however, have remained limited in several respects. Many of these limitations are addressed and overcome in copending U.S. patent application Ser. No. 08/551,889 (Attorney's Docket No. 028726-008), entitled Inverse Transform Narrow Band/Broad Band Additive Synthesis, filed on even date herewith and incorporated herein by reference. Not addressed in the foregoing patent application is the problem of constructing a suitable control structure that may be used to control additive sound synthesis in real time. Prior art methods have typically been limited to generating and playing sound described by pre-stored, analyzed parameters rather than values that change in real time during synthesis.
As recognized by the present inventors, the problem of constructing a suitable control structure that may be used to control additive sound synthesis in real time involves two sub-problems. One problem is to provide a user interface that may be readily understood and that requires only a minimum of control input signals. In other words, the user interface must offer simplicity to the user. Another problem is to translate this simplicity seen by the user into the complexity often required by the synthesizer and to do so in a time-efficient and hardware-efficient manner.
An important contribution to the user interface problem is found in Wessel, Timbre Space as a Musical Control Structure, Computer Music Journal 3 (2): 45-52, 1979, incorporated herein by reference. A fundamental musical property is that of timbre, i.e., the tone and quality of sound produced by a particular instrument. For example, a violin and a saxophone each have distinctively different timbres that are readily recognizable. The foregoing paper describes how to construct a perceptually uniform timbre space.
A timbre space is a geometric representation wherein particular sounds with certain qualities or timbres are represented as points. The timbre space is said to be perceptually uniform if sounds of similar timbre or quality are proximate in the space and sounds with marked difference in timbre or quality are distant. In such a perceptually uniform timbre space, perceptual similarity of timbres is inversely related to distance.
The basic idea is that by specifying coordinates in a particular timbre space, one is able to hear the timbre represented by those coordinates (e.g., a violin). If these coordinates should fall between existing tones in the space (e.g., in between a violin and a saxophone), an interpolated timbre results that relates to the other sounds in a manner consistent with the structure of the space. Smooth, finely graded timbral transitions can thus be formed, with the distance moved within the timbre space bearing a uniform relationship to the audible change in timbre.
Also discussed in the paper is the need to reduce the considerable quantity of data required by a general synthesis techniques such as additive synthesis without sacrificing richness in the sonic result. The approach suggested is the use of straight-line-segment approximations to approximate curvilinear envelope functions.
More recently, advances in machine learning techniques such as neural networks have been applied to the second sub-problem, that is translating the simplicity seen by the user into the complexity often required by the synthesizer and to do so in a time-efficient and hardware-efficient manner. Neural networks may be considered to be representative of a broader class of adaptive function mappers that map musical control parameters to the parameters of a synthesis algorithm. The synthesis algorithm typically has a large number of input parameters. The user interface, also referred to as the gestural interface, typically supplies fewer parameters. The adaptive function mapper is therefore required to map from a low dimensional space to a high dimensional space.
The use of a neural network in an electronic musical instrument is described in U.S. Pat. No. 5,138,924, incorporated herein by reference. Referring to FIG. 1, in accordance with the foregoing patent, a neural network 134 is used to translate user inputs from a wind controller 135 to outputs used by a synthesizer 137 of an electronic musical instrument. The synthesizer 137 is shown as being an oscillator bank. In operation, the player blows in breath from the mouthpiece 140, and controls the key system 141 with the fingers of both hands to play the instrument. Each key composing the key system 141 is an electronic switch. The ON/OFF signals caused by operation are input to the input layer 142 of the neural network 134. The neural network 134 is a hierarchical neural network having four layers, namely an input layer 142, a first intermediate layer 143, a second intermediate layer 144, and an output layer 145.
The number of neurons of the output layer 145 is equal to the number of oscillators 146 and attenuators 147. Each pair of neurons of the output layer 145 outputs the frequency control signal of the sine wave to be generated to the respective oscillator 146 and an amplitude control signal to the corresponding attenuator 147. The sine wave generated by the oscillator is attenuated to the specified amplitude value and input to an adding circuit 148. In the adding circuit 148 all the sine waves are added together with the resulting synthesis signal being input to the D/A converter 149. In the D/A converter 149 the synthesis signal is shaped to obtain a smooth envelope and is then output as a musical sound, which is amplified by a sound system (not shown).
In the foregoing arrangement, because additive synthesis is used, it is possible to use the results of analysis by FFT as training patterns for the neural network. That is, a musical tone of a specific pitch of the musical instrument to be learned is FFT-analyzed, and the results of the FFT (to which the ON/OFF pattern used to generate the tone corresponds) is input to the neural network as a training pattern. This process is performed for the entire range of tones to be produced.
Many of the techniques employed in additive music synthesis have been adopted from work in the area of speech analysis and synthesis. Further information regarding the application of neural networks and machine learning techniques to music synthesis can be found in Rahim, Artificial Neural Networks for Speech Analysis/Synthesis, Chapman & Hall, 199?.
Despite the known use of adaptive function mappers that map musical control parameters to the parameters of a synthesis algorithm, there remains a need for an improved control structure for music synthesis in which: 1) the sound representation provided to the adaptive function mapper allows for a greatly increased degree of control over the sound produced; and 2) training of the adaptive function mapper is performed using an error measure, or error norm, that greatly facilitates learning while ensuring perceptual identity of the produced sound with the training example. The present invention addresses this need.