There are techniques of monitoring persons and the like using acoustic information and video information. An example is a method of detecting a specific speech pattern from a speech signal and acquiring an image of the surroundings where the speech signal is acquired and processing the image by enlargement, filtering, interpolation, or the like or generating a stereoscopic image of the surroundings where the speech signal is acquired, thus facilitating identification of any abnormality (for example, see Patent Literature (PTL) 1). Another example is a method of recording sounds generated in a monitoring area and images of chief locations using acoustic sensors and image sensors, detecting a specific event by analyzing acoustic data, tracking a mobile object based on the detection result, acquiring image data of the mobile object, and performing image analysis (for example, see PTL 2). The methods described in PTL 1 and PTL 2 are both techniques of, when triggered by speech or sound, performing other image processing.
These methods are not intended to analyze the action of a crowd (hereafter referred to as “crowd action”). The crowd mentioned here is a collection of individuals subjected to action analysis. As a method intended to analyze crowd action, there is a method of determining whether an event involves a single person or a group of people and what the event is (a fight, a crime, etc.) by acoustic analysis and image analysis (for example, see PTL 3).