1. Technical Field
The invention relates generally to detecting and localizing sound. More particularly, this invention relates to detecting and/or localizing sound that includes sound events in a complex sound field
2. Related Art
Many sound-based applications, such as audio reproduction systems, audio and speech coding systems, speech recognition systems, and audio amplification systems require the ability to distinguish and detect certain types of sound and to determine the directions from which the sound emanates or originates. The ability to detect certain types of sound is important to applications such as sound amplification, while the ability to detect and localize sound is crucial to applications involving sound reproduction. Unfortunately, the detection and localization of sound can be very complicated because, whether live or reproduced, sound generally consists of a complicated combination of many different sounds, which rarely occur by themselves.
These many different sounds may occur over time to form a complex sound field, in which the sounds can overlap, occur one after the other, or in any combination. One way in which the individual sounds in a sound field are classified is according to whether an individual sound has emanated or originated from a particular location. Sounds that can be detected as emanating or originating from a particular direction are referred to as directional sounds, while sounds that can be detected as emanating or originating from no particular direction at all are referred to as non-directional sounds. Another way of classifying individual sounds is according to whether an individual sound is a transient or a steady-state sound. Steady-state sounds are those that have a generally constant level of power over time, such as a sustained musical note. Steady-state sounds can be directional or non-directional sounds. Transient sounds (or “transients”) are sounds that have an initial energy spike, such as a shout or a drum hit. Transients can also be directional or non-directional sounds. An example of a non-directional transient sound is speech in a reverberant space where the direct speech is blocked by an object. In this case, if the reverberation time of the speech is less than one second, the time characteristics of the signal are preserved, but information about its direction is lost.
Directional transients are referred to in this application collectively as “sound events.” Two types of sound events are syllables and impulsive sounds. Syllables include phonemes and notes. Phonemes are transient sounds that are characteristic of phones in human speech and can be particularly useful in detecting and localizing syllables in human speech. Notes are the individual notes created, for example, by a musical instrument. Syllables, including notes and phonemes, generally have the following characteristics: a finite duration of at least about 50 ms up to about 200 ms, but typically about 150 ms; rise times of about 33 ms; generally occur no more frequently than about once every 0.2 ms to about once every 0.5 ms; and may have low or high volume (amplitude). In contrast, impulsive sounds are transients of very short duration, such as a drum hit or fricatives and explosives in speech. Impulsive sounds generally have the following characteristics: a short duration of about 5 ms to about 50 ms, rise times of about 1 ms to about 10 ms, and high volume.
To detect sounds in a sound field, whether generated live or as a reproduction, generally the sound field need only be generated in one input or “input channel.” However, to localize sounds, generally the sound field needs to be generated in at least two inputs or input channels. The archetype for sound localization is natural hearing, where the azimuth of the sound is detected primarily by the arrival time difference between the two input channels represented by the two ears. When localizing sounds electronically, the azimuth of a sound source is determined primarily by the amplitude and phase relationships between the signals generated in two or more input channels. Generally, in order to describe the azimuth of directional sounds from these input channels, the direction of the source of these sounds is described in terms of an angle between each corresponding pair of input channels (each an “input channel pair”). If sounds are generated in only two channels, the directions of the sounds are given in terms of an angle for that input channel pair, generally a left/right angle “lr.” In this case, the value for lr ranges from about −45 degrees to about 45 degrees, with −45 degrees indicating that the sound field originates from the left input channel, 45 degrees indicating that the sound field originates from the right input channel, and 0 degrees indicating that the sound field originates from a position in the middle, precisely between the right and left input channels (a position often referred to as “center”).
However, when the sound field is generated in two channel pairs, such as in a surround sound system, a second directional component is specified. Even if the sound field is generated in only one channel pair, a second directional component may also be specified because it is often possible to derive an additional channel pair from the one channel pair. The second directional component may include a front/back or center/surround angle “cs.” The value for cs also ranges from about −45 degrees to about 45 degrees, with lr=0 and cs=45 degrees indicating that the sound field originates from the center input channel only, and lr=0 and cs=−45 degrees indicating that the sound field originates from the rear input channel only. Similarly, lr=−45 and cs=0 degrees indicates a sound originating from the left and lr=45 and cs=0 degrees indicating that the sound field originates from the right. Additionally, lr=−22.5 degrees and cs=−22.5 degrees indicates that the sound field originates from the left rear and lr=22.5 and cs=−22.5 indicates that the sound field originates from the right rear.
One known technique for determining these angles is used in reproducing recorded sound. In general, this known technique determines the intended direction of sounds by comparing the amplitudes of the signals in one input channel of a input channel pair with the signals in the corresponding input channel of the input channel pair (generally, the left with the right, and the center with the surround). More specifically, this ratio of amplitudes is used to determine what is generally referred to an “ordinary steering angle” or “OSA” for each input channel pair. To obtain the OSA, the voltage signals in each input channel of an input channel pair are rectified and the logarithms of the rectified voltages are taken. By subtracting the logarithm of the rectified voltage of one input channel from the log of the rectified voltage of the other input channel in the input channel pair, a signal is produced that equals the logarithm of the ratio of the voltages in the input channel pair which, when converted back into the magnitude domain, is the ordinary steering angle. In surround reproduction systems, this determination is often made by a device called a matrix decoder.
Unfortunately, this known technique treats the entire sound field as if it contains only a single sound because it determines the direction of the entire sound field according to the relative voltage strength in each input channel. Therefore, many directional individual sounds will not be properly localized. In order to treat the sound field as a complex combination of many sounds, attempts have been made to devise filters that will separate the directional transient sounds (sound events) so that their directions can be independently determined. However, a fundamental problem is encountered when designing such a filter. If the filter is made fast enough to distinguish the fluctuations of all directional transient signals, it will also distinguish fluctuations characteristic of non-directional transient signals such as reverberation and noise. As a result, the rapid fluctuations of reverberation and noise are reproduced as directional changes in the sound, which severely degrades the quality of the reproduced sound. On the other hand, if the filter is made slow enough not to distinguish the fluctuations characteristic of the non-directional signals, the filter is generally too slow to distinguish the fluctuations of certain sound events, particularly impulsive sounds. As a result, many sound events are not properly localized. No matter how these filters are designed, they generally work well on only one type of music but not on all. For example, the fast filter will work well on complex popular music, which is full of rapid changes, but will reflect false directional changes (steer too greatly) when a highly-reverberant classical piece is reproduced.
Additional problems arise when sounds are recorded in a given number of input channels and then reproduced over a different number of channels. For example, two common classes of sound recording and reproduction techniques are stereo and surround. Sounds recorded for reproduction in stereo (two channels) are intended to be perceived as originating only from the front. Sounds recorded for reproduction in surround (any number of input channels greater than two, but generally five or seven channels) are intended to be perceived as originating from all around, generally with one or two input channels used to reproduce sounds from the rear. The techniques used to record sounds intended for reproduction in stereo are generally different from those used to record sounds intended for reproduction in surround. However, because surround systems are not universally used, sounds recorded for reproduction in surround generally need to be capable of high-quality reproduction in stereo. For example, in a typical five channel surround system, the sounds in the center channel are encoded into the right and left input channels so that the sounds included in the center channel “c” equal the sum of the sounds included in the left and right input channels (c=l+r). Similarly, the sounds in the surround channel are encoded into the left and right input channels so that the sounds included in the surround channel “s” equal the difference between the sounds included in the left and right input channels (s=l−r). In another example, the Dolby Surround® system, which records sounds for reproduction in surround, adds a negative phase to the sounds intended for reproduction from behind the listener (the rear). This negative phase is generally undetected by stereo reproduction systems and is transparent to the listener. However, the negative phase is detected by a surround reproduction system that then reproduces the associated sounds in the rear input channels. Unfortunately, many sounds naturally have negative phase, even when recorded in stereo format, and are therefore incorrectly reproduced in the rear input channels by a surround reproduction system. This can be distracting and unnatural.