As a security apparatus, an information processing apparatus is known in PTL 1 which outputs a warning sound to inform detection of an abnormal sound when detecting the abnormal sound generated within a predetermined area (for example, see PTL 1). In the information processing apparatus shown in PTL 1, the type of the detected abnormal sound is determined.
Then, an apparatus has been known which forms the directivity of a sound in a direction toward the actual position corresponding to the position selected from the microphone array, and plays back the sound of which directivity is formed, when monitoring an image in real time or playing back later the sound which was picked up previously and checking it, by using a microphone array and a camera integrally or separately, if a certain position on the image from a camera is selected (for example, see PTL 2).
However, in a case where an abnormal sound is detected at a place within the area, a warning sound is output uniformly and is notified to the user, but it may be preferable not to output a warning sound depending on a situation in the area in terms of not reducing the convenience of the user.
For example, a big sound output from a television or an audio device provided in an area is erroneously detected as an abnormal sound, which may degrade the convenience of the user who is an observer. Such a sound is used in the program for the purpose of entertainment of the viewer or the listener of the sound output from the television or the audio device, and in other words, is a sound that is not regarded as an abnormal sound.
Furthermore, in a case where the directivity of a sound is formed for the image from the camera having a fixed angle of view, and an operation such as a zoom in or zoom-out operation is performed on the image from the camera, a method of forming the directivity of the sound is not considered. Similarly, in a case where the camera is a pan tilt zoom (PTZ) camera which is freely driven in a pan direction and a tilt direction and is capable of changing an optical axis, the image displayed on the display is switched by driving the PTZ camera, but in such a case, a method of forming the directivity of the sound is not considered.
Therefore, in a case where the zoom operation or the like is performed on the image imaged by the camera, the imaged image displayed on the display and the position where the directivity of the sound is formed (sound position) do not match (that is, not one-to-one correspondence). For example, when a plurality of people displayed on the display are on conversation, even if the user zooms in a specific person and the face of the person is enlarged and displayed on the screen, a voice other than the voice of the person is output from a speaker, or the voice of the person is output while the volume is small, such that an operator is likely to feel uncomfortable.
Therefore, each time the screen displayed on the display is switched, the user specifies a desired position on the image displayed on the switched screen, the directivity of the sound needs to be formed again in a direction corresponding to the position, and there is a possibility that the operation of the user becomes complicated.
An object of the present disclosure is to associate the displayed image of a sound pickup area with a position where the directivity of a sound picked up in the sound pickup area is formed, and follow and switch the position where the directivity of a sound is formed, according to the switching of the displayed image. Furthermore, an object of the present disclosure is to clearly distinguish a sound which is not regarded as an abnormal sound.