A goal for high-fidelity reproduction of recorded or transmitted sounds is the presentation at another time or location as faithful a representation of an "original" sound field as possible given the limitations of the presentation or reproduction system. A sound field is defined as a collection of sound pressures which are a function of time and space. Thus, high-fidelity reproduction attempts to recreate the acoustic pressures which existed in the original sound field in a region about a listener.
Ideally, differences between the original sound field and the reproduced sound field are inaudible, or if not inaudible at least relatively unnoticeable to most listeners. Two general measures of fidelity are "sound quality" and "sound field localization."
Sound quality includes characteristics of reproduction such as frequency range (bandwidth), accuracy of relative amplitude levels throughout the frequency range (timbre), range of sound amplitude level (dynamic range), accuracy of harmonic amplitude and phase (distortion level), and amplitude level and frequency of spurious sounds and artifacts not present in the original sound (noise). Although most aspects of sound quality are susceptible to measurement by instruments, in practical systems characteristics of the human hearing system (psychoacoustic effects) render inaudible or relatively unnoticeable certain measurable deviations from the "original" sounds.
Sound field localization is one measure of spatial fidelity. The preservation of the apparent direction (both azimuth and elevation) and distance of a sound source is sometimes known as angular and depth localization, respectively. In the case of certain orchestral and other recordings, such localization is intended to convey to the listener the actual physical placement of the musicians and their instruments. With respect to other recordings, particularly multiple track recordings produced in a studio, the angular directionality and depth may bear no relationship to any "real-life" arrangement of sound sources and the localization is merely a part of the overall aaistic impression intended to be conveyed to the listener. For example, speech seeming to originate from a specific point in space may be added to a pre-recorded sound field. In any case, one purpose of high-fidelity multi-channel reproduction systems is to reproduce spatial aspects of an on-going sound field, whether real or synthesized. As with respect to sound quality, in practical systems measurable changes in localization are, under certain conditions, inaudible or relatively unnoticeable because of characteristics of human hearing.
It is sufficient to recognize that a sound-field producer may develop recorded or transmitted signals which, in conjunction with a reproduction system, will present to a human listener a sound field possessing specific characteristics in sound quality and sound field localization. The sound field presented to the listener may closely approximate the ideal sound field intended by the producer or it may deviate from it depending on many factors including the reproduction equipment and acoustic reproduction environment.
A sound field captured for transmission or reproduction is usually represented at some point by one or more electrical signals. Such signals usually constitute one or more channels at the point of sound field capture ("capture channels"), at the point of sound field transmission or recording ("transmission channels"), and at the point of sound field presentation ("presentation channels"). Although within some limits as the number of these sound channels increases, the ability to reproduce complex sound fields increases, practical considerations impose limits on the number of such channels.
In most, if not all cases, the sound field producer works in a relatively well defined system in which there are known presentation channel configurations and environments. For example, a two-channel stereophonic recording is generally expected to be presented through either two presentation channels ("stereophonic") or one presentation channel ("monophonic"). The recording is usually optimized to sound good to most listeners having either stereophonic or monophonic playback equipment. As another example, a multiple-channel recording in stereo with surround sound for motion pictures is made with the expectation that motion picture theaters will have either a known, generally standardized arrangement for presenting the left, center, right, bass and surround channels or, alternatively, a classic "Academy" monophonic playback. Such recordings are also made with the expectation that they will be played by home playback equipment ranging from single presentation-channel systems such as a small loudspeaker in a television set to relatively sophisticated multiple presentation-channel surround-sound systems.
Various techniques attempt to reduce the number of transmission channels required to carry signals representing multiple-dimensional sound fields. One example is a 4-2-4 matrix system which combines four channels into two transmission channels for transmission or storage, from which four presentation channels are extracted for playback. Another more sophisticated technique is subband steering which exploits psychoacoustic principles to reduce the number of transmission channels without degrading the subjective quality of the sound field. An encoder/decoder system utilizing subband steering is disclosed in U.S. patent application Ser. No. 07/638,896.
Such techniques may be used without departing from the scope of the present invention, however, it may not always be desirable to do so. The use of these techniques make it necessary to develop the concept of a "delivery channel." A delivery channel represents a discrete encoder channel, or a set of information which is independently encoded. A delivery channel corresponds to a transmission channel in systems which do not use techniques to reduce the number of transmission channels. For example, a 4-2-4 matrix system carries four delivery channels over two transmission channels, ostensibly for playback using four presentation channels. The present invention is directed toward selecting a number of presentation channels which differs from the number of delivery channels.
An example of a simple prior art technique which generates one presentation channel in response to two delivery channels is the summing of the two delivery channels to form one presentation channel. If the signal is sampled and digitally encoded using Pulse Code Modulation (PCM), the summation of the two delivery channels may be performed in the digital domain by adding PCM samples representing each channel and converting the summed samples into an analog signal using a digital-to-analog converter (DAC). The summation of two PCM coded signals may also be performed in the analog domain by converting the PCM samples for each delivery channel into an analog signal using two DACs and summing the two analog signals. Performing the summation in the digital domain is usually preferred because a digital adder is generally more accurate and less expensive to implement than a high-precision DAC.
This technique becomes much more complex, however, if signal samples are digitally encoded in a nonlinear form rather than encoded in linear PCM. Nonlinear forms may be generated by encoding methods such as logarithmic quantizing, normalizing floating-point representations, and adaptively allocating bits to represent each sample.
Nonlinear representations are frequently used in encoder/decoder systems to reduce the amount of information required to represent the coded signal. Such representations may be conveyed by transmission channels with reduced informational capacity, such as lower bandwidth or noisy transmission paths, or by recording media with lower storage capacity.
Nonlinear representations need not reduce informational requirements. Various forms of information packing may be used only to facilitate transmission error detection and correction. The broader terms "formatted" and "formatting" will be used herein, therefore, to refer to nonlinear representations and to obtaining such representations, respectively. The terms "deformatted" and "deformatting" will refer to reconstructed linear representations and to obtaining such reconstructed linear representations, respectively.
It should be mentioned that what constitutes a "linear" representation depends upon the signal processing methods employed. For example, floating-point representation is linear for a Digital Signal Processor (DSP) which can perform arithmetic with floating-point operands, but such representation is not linear for a DSP which can only perform integer arithmetic. The significance of "linear" will be discussed further in connection with the DETAILED DESCRIPTION OF THE INVENTION, below.
A decoder must use deformatting techniques inverse to the formatting techniques used to format the information to obtain a representation like PCM which can be summed as described above.
Two encoding techniques which utilize formatting to reduce informational requirements are subband coding and transform coding. Subband and transform coders attempt to reduce the amount of information transmitted in particular frequency bands where the resulting coding inaccuracy or coding noise is psychoacoustically masked by neighboring spectral components. Psychoacoustic masking effects usually may be more efficiently exploited if the bandwidth of the frequency bands are chosen commensurate with the bandwidths of the human ear's "critical bands." See generally, the Audio Engineering Handbook, K. Blair Benson ed., McGraw-Hill, San Francisco, 1988, pages 1.40-1.42 and 4.8-4.10. Throughout the following discussion, the term "subband" shall refer to portions of the useful signal bandwidth, whether implemented by a true subband coder, a transform coder, or other technique. The term "subband coder" shall refer to true subband coders, transform coders, and other coding techniques which operate upon such "subbands."
Signals in a formatted form cannot be summed directly, therefore each of the two delivery channels must be decoded before they can be combined by summation. Generally, decoding techniques such as subband decoding are relatively expensive to implement. Therefore, monophonic presentation of a two-channel signal is approximately twice as costly as monophonic presentation of a one-channel signal. The cost is approximately double because an expensive decoder is needed for each delivery channel.
One prior art technique which avoids burdening the cost of monophonic presentation of two-channel signals is matrixing. It is important to distinguish matrixing used to reduce the number presentation channels from matrixing used to reduce the number of transmission channels. Although they are mathematically similar, each technique is directed to very different aspects of signal transmission and reproduction.
One simple example of matrixing encodes two channels, A and B, into SUM and DIFFERENCE delivery channels according to EQU SUM=A+B, and EQU DIFFERENCE=A-B.
For two-channel stereophonic playback, a presentation system can obtain the original two-channel signal by using two decoders to decode each delivery channel and de-matrixing the decoded channels according to EQU A'=1/2.multidot.(SUM+DIFFERENCE), and EQU B'=1/2.multidot.(SUM-DIFFERENCE).
The notation A' and B' is used to represent the fact that in practical systems, the signals recovered by de-matrixing generally do not exactly correspond to the original matrixed signals.
For monophonic playback, a presentation system can obtain a summation of the original two-channel signal by using only one decoder to decode the SUM delivery channel.
Although matrixing solves the problem of disproportionate cost for monophonic presentation of two delivery channels, it suffers from what may be perceived as cross-channel noise modulation when it is used in conjunction with encoding techniques which reduce the informational requirements of the encoded signal. For example, "companding" may be used for analog signals, and various bit-rate reduction methods may be used for digital signals. The application of such techniques stimulates noise in the output signal of the decoder. The intent and expectation is that this noise is masked by the audio signal which stimulated it, thus making it inaudible. When such techniques are applied to matrixed signals, the de-matrixed signal may be incapable of masking the noise.
Assume that a matrix encoder encodes channels A and B where only channel B contains an audio signal. The SUM and DIFFERENCE signals are coded for transmission with an analog compander or a digital bit-rate reduction technique. During decoding, the A' presentation channel will be obtained from the sum of the SUM and DIFFERENCE delivery channels. Although the A' presentation channel will not contain any audio signal, it will contain the sum of the analog modulation noise or the digital coding noise independently injected into each of the SUM and DIFFERENCE delivery channels. The A' presentation channel will not contain any audio signal to psychoacoustically mask the noise. Furthermore, the noise in channel A' may not be masked by the audio signal in channel B' because the ear can usually discern noise and audio signals with different angular localization.
Techniques used to control the number of presentation channels become even more of a problem when more than two delivery channels are involved. For example, motion picture soundtracks typically contain four channels: Left, Center, Right, and Surround. Some current proposals for future motion picture and advanced television applications suggest five channels plus a sixth limited bandwidth subwoofer channel. When multiple-channel signals in a formatted form are delivered to consumers for playback on monophonic and two-channel home equipment, the question arises how to economically obtain a signal suitable for one- and two-channel presentation while avoiding the cross-channel noise modulation effect described above.