1. Field of the Invention
The present invention relates to a method of detecting an object in an image, and more particularly to a method of accurately detecting an object under a cluttered background.
2. Description of the Background Art
An object detection is a well-known task in computer vision addressing the problem to find an object in an image. In recent methods, information involved with an object is represented by a collection of local features. The local feature is a feature describing small regions of the image. The amount and quality of local features obtained from an image vary highly depending upon the used algorithms.
A common problem of the method of detecting an object is how to deal with local features extracted from a background. The background means a part of the image that does not belong to the object. The background may be the same for different objects. Therefore, if the local feature extracted from the background is used without being distinguished from the local feature extracted from the previous background, no reliable detection is possible. Accordingly, a system of not using the local features extracted from the background is demanded.
To solve this problem, many approaches employ supervised learning system. Specifically, three different approaches described below are proposed.
The first solution is to learn local features of an object from images, which do not contain any background at all. The image that does not contain any background at all can be achieved by segmenting the object from the background, for example. It is also possible to take images of the object with a background of one color, and this color is not used to extract information (Chroma key). These methods take much time and labor. Therefore, this solution can only be chosen if the amount of objects is small.
The second solution is to learn background by taking images, which do not show the object. With this method, one would know what local features from the background are extracted, and what local features from the object are extracted. However, this solution has a problem described below. For example, in an image showing a person, a bookshelf could be seen as a background. If one also wants to detect the bookshelf at the same time, the bookshelf could not be the background any more. In other words, the definition of the background depends upon what the object to be detected is. Therefore, this solution would not be correct.
The third solution is to use many images showing the object in front of different backgrounds. Only the local features that could be extracted from all images are assumed to belong to the object. The problem of this solution is that the amount of images needed for this approach is high, and must all be provided by the user.
Why the background clutter becomes a problem is that all local features extracted from an image are treated equally, regardless of the local features extracted from a background or extracted from an object. One of the methods for solving this problem is to manually distinguish the respective local features. However, this method is unrealistic, from the viewpoint of the number of local features. Another method is to use an image with no background in order to easily extract an object. There is also another method using images that contain various backgrounds for a single object.
Another problem arises from the Bag-of-Features strategy that is often used for an object recognition and object detection. In this strategy, clustering is executed to each local feature to obtain a representative vector (referred to as visual word), and one vector (feature) representing the feature of the object is generated from the obtained representative vector. The local features are only held as a collection of individual information without any correlation among them during the clustering. A stable feature that is effective for identification can be acquired by the “Bag-of-Features strategy”. On the other hand, much information, such as the position, size, and direction of the local feature, is lost.
There are two models addressing this problem, the constellation model and the implicit shape model. The constellation model and the implicit shape model can be methods for compensating information, which is involved with the location of each local feature, and which is lost by the clustering process. How to treat the information is different between these two models. Each model will be described below.
<Constellation Model>
In the constellation model, local features extracted from an object are stored as positions in a two-dimensional probability space. To tell the conclusion first, the constellation model can be effective when a locations of a main visual word is not so different between a query image indicating an object to be detected and an image that should be detected by using the query image. In the constellation model, only a few features (normally around 5) are used to create the model in order to keep the resultant graph (the graph indicating local features as node, and indicating the location configuration of local features as edge) computable.
It could also be made scale invariant through normalization concerning one local feature.
As Fergus et al. have already pointed out, a weak point of this approach is to highly depend upon a feature detector (see, for example, Non-Patent Document 1). If it fails to detect these features defined over large regions of the image (e.g., the complete wheels of a bicycle), the results are not useful anymore. Another weak point is that this model is not rotation invariant, and cannot address viewpoint changes.
<Implicit Shape Model>
Leive et al. have proposed the implicit shape model (see, for example, Non-Patent Document 2). In this model, a shape is not represented by the relative location of visual words. For each visual word, its relative position to a predefined centroid point is used. During the detection, the visual words extracted from the query image are compared to visual words in an image database, and a possible centroid position is proposed. This is considered to be a voting to a possible object. These votes are agglomerated to find a possible object. Non-Patent Document 3 proposes a method of making this model scale invariant. This model is flexible enough to address the problems of a high intra-class variation of the object. This is achieved by sharing the local features of the object learned from different images.