An application of recognizing the condition and action of an object person from the expression of the face portion of the object person and applying them has been known. For example, as a function of a car navigation system (hereinafter referred to as a CNS for short) mounted in a vehicle, there is a voice operation function of performing an operation of the CNS such as issuing an instruction, etc. of a destination by voice by inputting voice to a microphone, etc. provided in a vehicle. The contents of the voice input through the microphone are recognized by speech recognition. When a driver inputs by voice the destination (for example, a destination such as the name of a place, the name of a facility, etc.) a word indicating the destination is recognized by speech recognition to retrieve a route to the place indicated by the recognized word or display information, etc. However, when a voice operation function is used, there is the problem that unnecessary sound such as the conversation of a passenger other than the driver, the music from a car stereo, road noise, the sound of wind, the sound of the engine, etc. is input to the microphone, thereby considerably reducing the accuracy of speech recognition. The technology of solving the problem is the speech recognition device of JP11-352987A (hereinafter referred to as Patent Document 1) and the image recognition device of JP11-219421A (hereinafter referred to as Patent Document 2).
The speech recognition device of the Patent Document 1 captures an image of a speaker using a camera, processes a captured image by an image processing ECU, and classifies the presence/absence of speech from the state of the appearance of the speaker. For example, the presence/absence of speech is classified from the external condition such as the face direction, the movement of lips, the gaze direction, etc. In processing a captured image for detecting the face direction, the movement of lips, the gaze direction, a pattern matching method is used. That is, when it is classified that a speaker is giving a speech, speech recognition is performed, thereby improving the recognition accuracy. The template matching method in the pattern matching method realizes face detection and detection of other portions by preparing a representative image pattern or an average image pattern of a face to be detected or another portion as a template, and searching an image area closest to the template image from the entire image.
Additionally, the image recognition device of the Patent Document 2 includes: an image acquisition unit for acquiring a distance image stream for a target object; an oral cavity portion extraction unit for extracting an oral cavity portion from the distance image stream acquired by the image acquisition unit; and an image recognition unit for recognizing at least one of the shape of lips and the movement of the lips based on the distance image stream of the oral cavity portion extracted by the oral cavity portion extraction unit. In extracting the oral cavity portion, the template matching method, etc. is used as in the voice recognition device according to the Patent Document 1. Furthermore, in the image recognition unit, a template of the shape image of the oral cavity portion corresponding to the pronunciation such as “a”, “i”, etc. is prepared, and the contents of the speech is recognized by performing a matching operation between the template and an image of the extracted oral cavity portion.
Furthermore, there are a driving state detection device of JP8-175218A (hereinafter referred to as Patent Document 3), a sleeping state detection device of JP10-275212A (hereinafter referred to as Patent Document 4), and an anti-drowsy-driving device of JP2000-40148A (hereinafter referred to as Patent Document 5) as the technology of capturing an image of the face of an object person, processing the captured image, and detecting whether or not a driver is awake.
The driving state detection device of the Patent Document 3 performs a relative correlation operation using an object template on a captured image, detects the eye area of the driver, and classifies the driving state of the driver from the image of the detected eye area.
The sleeping state detection device of the Patent Document 4 detects the density of the pixels along the row pixels of a face image, determines each pixel for a locally high value of the density in the pixel string, defines it as an extraction point, couples the extraction points in adjacent pixel strings close to each other in the pixel string direction, detects the position of the eyes from the curve group extending in the horizontal direction of the face, then detects the position of the eyes in a predetermined area including the eyes, classifies the arousal state of the eyes in the predetermined area including the eyes, and detects the drowsy state from the change in the open/closed state change.
The anti-drowsy-driving device of the Patent Document 5 sequentially acquires the pictures including the eyes of a driver on a vehicle as moving pictures by a video camera, calculates the area of the region in which the brightness has changed between the latest pictures and the previous pictures stored in frame memory, and performs a correlation operation for obtaining coefficients of correlation between a time-serial pattern of the difference in area between an area in which the brightness increases and an area in which the brightness decreases and a standard blink waveform. When the relative coefficient exceeds a reference value, the blink time point is extracted, and the arousal state of the driver is classified based on the blink extraction.
However, in the conventional technology according to the above-mentioned Patent Documents 1 and 2, a template matching method is used in detecting the lip portion from the image captured by a fixed camera. Therefore, for example, when the lip portion is detected from the image of a face as a side view or a diagonally side view due to a change in direction of the face during driving, there is the possibility that the detection accuracy is extremely lowered due to the contents of a prepared template. Furthermore, since the lip portion is searched in the image of the entire face, there are a large number of search points, thereby lowering the process speed.
Additionally, in the image recognition device of the Patent Document 2, the size, etc. of the oral region when the mouse is open is classified using a threshold, and the speech section is detected. Therefore, for example, it is difficult to classify the behavior content from an obscure image, such as a distinction between a yawn and speech.
Furthermore, in the conventional technologies of the Patent Documents 3 to 5, the frequency of blinks in a predetermined time period, the accumulation value of open and close times of the blinks in a predetermined time period, etc. are used in classifying the arousal state. With the configuration, the arousal state cannot be classified with the information about the aperture, duration time, speed, etc. of each blink taken into account, which are considered to be effective in classifying the arousal state from the viewpoint of the physiology.
Thus, the present invention has been developed to solve the above-mentioned problem not yet solved by the conventional technology, and the object of the present invention is to provide a behavior content classification device, a speech content classification device, a car navigation system, an alarm system, a behavior content classification program, and a behavior content classification method that are preferable in classifying the behavior contents of an object person from a captured image including the face of the object person.