1. Field of the Invention
The present invention relates to a method and an apparatus for generating an image description vector, as well as an image detection method and an image detection apparatus.
2. Description of the Related Art
Great progress has been made for detection technology of specific objects or targets (such as people, faces, cars, etc.) in the last few decades. In order to describe morphology of an image, discriminative features or patterns can be extracted from the image to form an image descriptor (image description vector). In some techniques, a training process which uses a large quantity of samples is necessary. While for more general or training-free object detection, an effective and robust feature descriptor (description vector) is very important.
Traditional image regional description method uses a feature vector which represents the global distribution of the feature of the image without any spatial or structure related information, thus it has limited discrimination and be liable to false positive.
Recently, image partition based methods have been introduced. Generally, the image partition based methods firstly partition the whole image into pre-defined regions; then generate a feature vector for each of the partitioned image regions independently; and finally combine all the generated feature vectors into a single descriptor for the whole image. Therefore, the single descriptor for the whole image integrates spatial or structure related information of the image and will be more powerful to represent the image.
In particular, two kinds of image partition based methods have been widely used. One image partition based method is called grid based image partition method, as shown in FIG. 1. The grid based image partition method firstly divides the image into multiple regular grids, for example, square grids, then extracts feature vector in each grid region and finally assembles all the grid region feature vectors into a single feature vector sequentially to obtain the descriptor for the image.
The single descriptor for the whole image is commonly illustrated in form of histogram as shown in FIG. 1, in which the abscissas axis indicates the color component in a predetermined color space in bin, and the vertical axis indicates the appearance frequency of the pixels corresponding to each color component, which usually can be represented by the number of the pixels.
However, the grid based image partition method is sensitive to in-plane rotation, for example when the image is rotated by 90°, the histogram of the descriptor will be much different. That is, the descriptor generated by the grid based image partition method is sensitive to in-plane rotation and not so robust.
Another image partition based method is called circular based image partition method, as shown in FIG. 2. The circular based image partition method divides the image into multiple concentric ring regions, and the center of the multiple concentric ring regions is the image center. Thus the descriptor generated by the circular based image partition method is robust against in-plane rotation. But the circular based image partition method does not consider the image characters, which limits its robustness. Furthermore, the method is sensitive to out-plane rotation and deformation.
As seen from the above description, although the image partition based method can generate an image descriptor which includes spatial or structure related information on the image and may somewhat powerfully present the image, the image partition based method is sensitive to rotation of the image, such as the in-plane or out-plane rotation, and thus the robustness of the descriptor generated by the image partition based method is limited.
Currently, pixel classification based methods have been proposed to overcome the drawbacks of the image partition based methods and to achieve more robust descriptor. The pixel classification based methods firstly classify the pixels included in the image into several categories, then generate feature vectors for the categories and finally combine all the feature vectors into a single descriptor.
Two kinds of pixel classification based color feature description method have been introduced and widely used in content-based image retrieval.
One pixel classification based method is called as BIC (Border/Interior Color) method as cited in Document 3. The BIC method first quantizes pixels in a specified color space with a predefined quantization schema; after that, classifies the pixels as border or interior pixels. A pixel is classified as border when it is at border of the image itself (e.g. pixels at image outlier borders shown in FIG. 3) or at least one of its 4 neighbors has a different quantization color (e.g. pixels at the quantized borders); a pixel is classified as interior when all its 4 neighbors have the same quantization color. Finally, generate a descriptor constituted by two color histograms (shown in FIG. 3), the two color histograms representing statistic color distribution for interior and border pixels separately.
However, although the robustness of the descriptor generated by BIC method is somewhat improved, the descriptor is sensitive to illumination variation and noise since the image border is sensitive to illumination variation and noise. Furthermore, since the quantized border depends on color quantization schema, which makes the classification process to couple with color features, the descriptor is also feature dependent.
Another pixel classification based method is called as CCV (Color Coherent Vector) method as cited in Document 4. The CCV classification method is based on the size of the connected components and an pre-defined empirical size threshold is introduced. When a pixel is part of a contiguous region with size being bigger than the pre-defined threshold, classify the pixel as coherent pixel (e.g. Red (dark grey) and green (light grey) pixels in FIG. 4); otherwise, classify the pixel as incoherent pixel (e.g. Blue (medium grey) pixels in FIG. 4).
Based on the above states, only when at least one color of the image is full of texture or constitutes small scattered patches, the CCV method is efficient; otherwise, it will reduce to the simple global color histogram (GCH). Furthermore, the descriptor generated by the method is has no spatial or topology information, thus it is a little more effective than GCH.
As seen from the above description, although the pixel classification based methods are robust with respect to the in-plane or out-plane rotation, such methods may be sensitive to illumination variance and noise. Furthermore, the BIC method and CCV method both perform quantization before pixel classification, and thus the generation result of the two methods depend on color quantization schema, which make the classification process of the two method couple with color features. Therefore, the descriptor generated by the pixel classification method is further limited to color features of the image, that is, feature dependent.
In view of the above, few prior techniques can obtain descriptor for image which is robust against illumination variance, view-point change, non-rigid deformation, etc. and is feature independent.
Recently, Local Binary Pattern (LBP) descriptors and Local Ternary Pattern (LTP) descriptors are proposed as powerful grey-scale invariant local texture descriptors for describing microstructures of images (please see, for example, T. Ojala, M. Pietikainen and T. Maenpaa, “Multi-resolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns”, IEEE Transaction on pattern analysis and machine intelligence, 24(7), 2002, and Xiaoyang Tan and Bill Triggs, “Enhanced Local Texture Feature Sets for Face Recognition Under Difficult Lighting Conditions”, IEEE Transactions on Image Processing, pp. 1635-1650, 19(6), 2010). These two patterns (image descriptors) are widely used in the field of face recognition and have achieved great success.
Now the LBP descriptor and the LTP descriptor will be described briefly with reference to FIGS. 5 and 6.
FIG. 5 is a schematic diagram showing the principle of the LBP descriptor.
As shown in FIG. 5, the LBP method encodes each pixel in an image into one piece of 8-bit binary code. More specifically, for a 3×3 matrix of pixels, if a neighbouring pixel has a pixel value larger than or equal to that of the centre pixel, a bit representing this neighbouring pixel in the 8-bit binary code is set to “1”, and if a neighbouring pixel has a pixel value smaller than that of the centre pixel, a bit representing this neighbouring pixel in the 8-bit binary code is set to “0”. In this way, the 8-bit binary code for the centre pixel is formed by thresholding the eight neighbouring pixels with respect to the pixel value of the centre pixel. In FIG. 5, white dots indicate binary bit “1” and black dots indicate binary bits “0”. The LBP feature can describe texture structures around the encoded pixel (centre pixel).
However, the single threshold and the two-pixel comparison make the LBP method very sensitive to noise, and the reliability will decrease significantly under intensive illumination. In addition, the encoding schema limits the LBP feature to present only a small set of texture structures, like lighter or darker edges and dots. Furthermore, the structures represented by the LBP features merely capture the surrounding features of the pixel, while the feature of that pixel itself is lost.
FIG. 6 is a schematic diagram showing the principle of the LTP descriptor.
As shown in FIG. 6, the LTP method encodes each pixel in an image into an 8-bit ternary code. More specifically, for a 3×3 matrix of pixels, if a neighbouring pixel has a pixel value larger than an upper threshold, a bit representing this neighbouring pixel in the 8-bit ternary code is set to “1”, if a neighbouring pixel has a pixel value not larger than the upper threshold and not smaller than a lower threshold, a bit representing this neighbouring pixel in the 8-bit ternary code is set to “0”, and if a neighbouring pixel has a pixel value smaller than the lower threshold, a bit representing this neighbouring pixel in the 8-bit ternary code is set to “−1”. The upper threshold can be set as (centre pixel value +T), and the lower threshold can be set as (centre pixel value −T), where T is a constant margin which can be set as appropriate. In this way, the 8-bit ternary code for the centre pixel is formed by double-thresholding the eight neighbouring pixels with respect to the pixel value of the centre pixel. In FIG. 6, white dots indicate ternary bits “1”, black dots indicate ternary bits “−1” and grey dots indicate ternary bits “0”.
By using double-thresholding, the LTP feature can describe texture structures around the encoded pixel (centre pixel) with improved robustness, and can preserve more detailed structure of the image than the LBP feature.