The present disclosure relates to a signal processing method and apparatus for efficiently reproducing an audio signal, and more particularly, to an audio signal processing method and apparatus capable of adjusting the location of an audio object of an audio signal in correspondence to the location of a visual object included in a video signal.
With the development of video and sound technology, many multimedia contents that give a sense of immersion to users have been produced. The sense of immersion is an important factor in next generation contents such as 360-degree contents or VR contents. The content having excellent sense of immersion may make a user feel as if he is present in the virtual world in the content, and provide a user with a near-real experience.
In order to give a sense of immersion to contents during the production of the contents, various issues should be considered. First, the video and audio of the multimedia contents should basically harmonize with each other. That is, the moment when video content changes and the moment when audio content changes are required to coincide with each other temporally, and audio content related to video content should be located at the location where the video content exists. Next, a visual object or audio object provided to a user should be changed in correspondence to a user's gaze or head movement. These interactive features are particularly important in the next-generation contents described above, and the next-generation contents creators consider a method for effectively generating images and audios that immediately reflect a user's movement or manipulation as a major challenge.
If the video and the audio are not in harmony with one another, a user's sense of immersion for the corresponding multimedia content disappear instantly, and a user may not concentrate on the multimedia content due to incompatibility of the video and audio. That is, if the locations of visual objects in video and audio objects do not match with each other, a user feels a sense of heterogeneity due to inconsistency between a visual stimulus and an auditory stimulus. Also, in the case of next generation contents such as VR contents, if the location of an audio object does not change according to the direction of the head of a user, a sense of immersion may also be deteriorated.
Accordingly, a method for matching the locations of a visual object and an audio object with each other during the production of the contents is indispensably required. However, when producing or creating visual objects and audio objects, it is not easy to match the locations of two objects when their reference directions or locations are different. In addition, when the audio content does not have any interactive characteristics, such as a multi-channel stereo audio signal, there is no method currently available for changing the audio content in response to a change in a visual object. Also, there is a need for research on a method for using an audio signal that is not able to adjust the sound location according to the direction of the head of a user during the production of next generation contents, as in the case of the above-mentioned stereo audio signal.