As a sound playback type that is proposed in the related art, a stereo (2 channel) type, a 5.1 channel surround type (ITU-R BS.775-1) and the like are widely popular for consumer use. The 2 channel type, as schematically illustrated in FIG. 1, is a type in which pieces of different audio data are generated from a left speaker 11L and a right speaker 11R. The 5.1 channel surround type, as schematically illustrated in FIG. 2, is a type in which pieces of different audio data are input into a left front speaker 21L, a right front speaker 21R, a center speaker 22C that is arranged between the left front speaker 21L and the right front speaker 21R, a left rear speaker 23L, a right rear speaker 23R, and a subwoofer dedicated to a low frequency (generally 20 Hz to 100 Hz) (not shown) for output.
Furthermore, in addition to the 2 channel type and the 5.1 channel surround type, various sound playback types are proposed such as a 7.1 channel type, a 9.1 channel type, and a 22.2 channel type. According to any of the types described above, speakers are circularly or spherically arranged around a hearer (a listener), and ideally it is desirable that the listener listens to audio at a listening position (hearing position), a so-called sweet spot, which is equally distant from the speakers. For example, it is desirable that in the 2 channel type, the listener listens to audio at the sweet spot 12 and that in the 5.1 channel surround type, the listener listens to audio at the sweet spot 24. When the listener listens to audio at the sweet spot, a synthetic sound image resulting from sound pressure balance is localized at a manufacturer-intended place. Otherwise, when the listener listens to audio at places other than the sweet spot, generally, a sound image•sound quality deteriorates. These types are hereinafter collectively referred to as a multi-channel playback type.
On the other hand, aside from the multi-channel playback type, there is a sound source object-oriented playback type. The type is a type in which all sound is set to be sound that is generated by any sound source object, and each sound source object (which is hereinafter referred to as a “virtual sound source”) includes its own positional information and audio signal. In an example of music content, each virtual sound source includes sound of each musical instrument and positional information on a position at which the musical instrument is arranged.
Then, the sound source object-oriented playback type is a playback type (that is, a wavefront synthesis playback type) in which wavefronts of sound are synthesized, by a group of speakers that are arranged side by side in a linear or planar manner. Among these wavefront synthesis playback types, in recent years, a wave field synthesis (WFS) type disclosed in NPL 1 has been actively studied as one realistic implementation method that uses a group of speakers (hereinafter referred to as a speaker array) that are arranged side by side in a linear manner.
This wavefront synthesis playback type is different from the multi-channel playback type described above, and has characteristics that provide both good sound image and sound quality at the same time to a listener who listens to audio at any position before a group 31 of speakers that are arranged side by side, as schematically illustrated in FIG. 3. To be more precise, a sweet spot 32 in the wavefront synthesis playback type is wide as illustrated.
Furthermore, the listener who faces the speaker array and listens to audio in a sound space that is provided by the WFS type feels as if sound that is actually emitted from the speaker array was emitted from a sound source (a virtual sound source) that is virtually present in rear of the speaker array.
In the wavefront synthesis playback type, an input signal indicating the virtual sound source is set to be necessary. Then, generally, it is necessary that an audio signal for one channel and positional information on a virtual sound source are included in one virtual sound source. In the example of music content described above, for example, an audio signal that is recorded for each musical instrument and positional information on the musical instrument are included. However, the audio signal for each virtual sound source is not necessary for each musical instrument, but there is a need to express an arrival direction and volume of each piece of sound that are intended by a content manufacturer, using a concept called a virtual sound source.
At this point, because the most widely popular of the multi-channel types described above is a stereo (2 channels) type, stereo-type music content is considered. L (left) channel and R (right) channel audio signals in the stereo-type music content are played back through a speaker 41L installed to the left and a speaker 41R installed to the right using two speakers 41L and 41R as illustrated in FIG. 4. When the playback is performed in this manner, as illustrated in FIG. 4, only in a case where the listener listens to audio at a point that is equally distant from the speaker 41L and the speaker 41R, that is, at a sweet spot 43, vocal voice and bass sound are heard from a middle position 42b, piano sound is heard from a left-side position 42a, drum sound is heard from a right-side position 42c, and so forth. Thus, the sound image is localized and is heard as intended by the manufacturer.
It is considered that such content is played back using the wavefront synthesis playback type, and that the localization of the sound image as intended by the content manufacturer, which is a characteristic of the wavefront synthesis playback type, is provided to the listener at any position. To do so, as at a sweet spot 53 that is illustrated in FIG. 5, such a sound image as when heard within the sweet spot 43 in FIG. 4, has to be heard from any listening position. To be more precise, the vocal voice and the bass sound are heard from a middle position 52b, the piano sound is heard from a left-side position 52a, the drum sound is heard from a right-side position 52c, and so forth at the wide sweet spot 53 through a group 51 of speakers that are arranged side by side in a linear or planar manner. Thus, the sound image as intended by the manufacturer has to be localized and heard.
To solve such a problem, for example, a case is considered where an L channel sound and an R channel sound are arranged as virtual sound sources 62a and 62b, respectively, as illustrated in FIG. 6. In this case, because each of the L/R channels, as a single unit, does not indicate one sound source, but a synthetic sound image is generated by the two channels, although such a result is played back using the wavefront synthesis playback type, a sweet spot 63 is generated too, and the sound image is localized only at a sweet spot 63, as illustrated in FIG. 4. To be more precise, in order to realize the sound image localization, there is a need for separation into audio for each sound image from 2 channel stereo data by any means, and for generation of virtual sound source data from each piece of audio.
To solve the problems, in a method disclosed in PTL 1, 2 channel stereo data is separated into a correlation signal and a non-correlation signal based on a correlation coefficient of signal power for each frequency band, a synthetic sound image direction for the correlation signal is estimated, and a virtual sound source is generated from a result of the estimation, and is played back using the wavefront synthesis playback type and the like.