1. Field of the Invention
This invention relates to image processing, and more particularly, to video object tracking and storage and retrieval.
2. Background Art
An image indexing system can be useful in situations that require automatic visual monitoring. A vision system can take a snapshot of each object that occupies the field of view and collect an appropriate set of features for each object in addition to the storage on videotape or digitally on hard drives or rewritable DVDs. At the end of the monitoring an investigator can use image indexing algorithms to search for particular occurrences of an object. An image indexing system allows the investigator to estimate results without having to watch the entire videotape and without having to sequentially view each image in the database.
Image indexing is important because it allows people to quickly retrieve interesting images form large databases. Image databases have become common with the growth of the Internet, advances in video technology, and efficient digital video capture and storage. The technology allows people to automatically monitor scenes and store visual records in databases. In many cases it is time consuming for a human to view each image in the database; and therefore, some computer assistance is necessary to access the images in an efficient manner.
A quality image indexing system reduces the total number of images a user must view before the user sees images of interest. This reduces the amount of time a user must spend sorting through extraneous images. Image indexing algorithms must determine which objects in the database are similar. The algorithm must be robust to varying lighting conditions, varying image resolution, viewing the object from various viewpoints, and distractions present in the object's background. The image indexing technique must also be computationally simple to decrease the time it takes to return information to the user.
The Query By Image Content (QBIC) system allows users to query an image database for scenes, objects or a combination of scenes and objects. Some features used by the system include colors, textures, shapes, and edges. The system is capable of computing similarity using multiple features in combination. The similarity metrics used by QBIC are mostly Euclidean based. The QBIC system does not automatically generate features over the entire video sequence of a particular object. Instead, QBIC relies on features of an object in isolated images. QBIC is capable of two types of image queries: query by object and query by scene. A scene is defined to be a color image or single frame of video. An object is defined to be any part of a scene. Each scene has zero or more objects within it. Also the objects contained in a scene are identified semi-automatically or manually. To segment an object from an image a user outlines the object using QBICS user interface tools. Object features include average color, color histogram, texture, shape, and location within the scene. The features calculated for scenes include average color, color histogram, texture, positional edges, and positional color. A QBIC query can be for an object, a scene, or a combination of objects and scenes.
An object's center of mass in image coordinates is the object's location. Location is normalized by the width and height of the image to account for varying image sizes. To calculate shape features QBIC assumes that a binary mask sufficiently represents shapes, and that the shapes are non-occluded and planar. The shape is represented parametrically using heuristic shape features and moments, Hausdorff distance, parametric curves represented by the curves' spline control points, first and second derivatives of the parametric curves, and turning angles along the object's perimeter.
Texture features include contrast, coarseness, and directionality features. Contrast measures the range of lightness and darkness within a texture pattern. Coarseness measures the relative scale of a texture pattern. The directionality specifies the average direction of the pattern. To quantify color, QBIC calculates a k-element color histogram and uses a three-element vector of average Munsell color coordinates. The color histogram is usually quantized to 64 or 256 bins. The system also allows users to retrieve images based on features extracted from a rough sketch drawn by the user. The sketch is represented by a lower resolution edge map and stored in a 64×64×1 bit array.
Most QBIC features are calculated using weighted Euclidean distance metrics. The weights of each Euclidean component are equal to the inverse variance of each component. To calculate similarity between two color histograms X and Y, QBIC uses a quadratic-form S distance measure. The matrix S is a symmetric, positive definite color similarity matrix that transforms color differences such that distance is directly related to the perceptual difference between colors. This metric is∥Z∥=(ZTSZ)1/2,where Z=X−Y and S>0.The similarity measure calculated from parametric spline curves and derivatives calculated from the spline control points using Euclidean distance and quadratic forms with pre-computed terms. To calculate distance between the turning angles, QBIC uses a dynamic programming algorithm.
To calculate the similarity between a user drawn sketch and a scene, QBIC has developed a matching algorithm that compares the drawn edges to the automatically extracted edges.
At the MIT Artificial Intelligence laboratory, Stauffer developed an adaptive, visual tracking system to classify and monitor activities in a scene. As objects traverse the scene, Stauffer's system tracks these objects and records an object's location, speed, direction, and size. The system also stores an image of the object and a binary motion silhouette. The silhouette is obtained from difference imaging. Stauffer's method consists of three main parts, codebook generation, co-occurrence matrix definition, and hierarchical classification.
The first step develops a codebook of representations using Linear Vector Quantization (LVQ) on a large set of data collected by the tracker. A typical codebook size is 400. A codebook uses a set of prototypes to represent each input. After the codebook has been generated, each new input is mapped to symbols defined in the codebook. The input is mapped to the set of symbols that are the shortest distance away from the input. Large codebooks are needed to accurately represent complex inputs. The technique fails if there are not enough symbols to represent measured differences. As codebook size, M, increases the number of data samples needed to generate working codebooks is on the order of M and the data needed to accumulate co-occurrence statistics is on the order of M2.
After the codebook has been generated, the system creates an M×M co-occurrence matrix. Assuming that there are N classes represented by the data, class c has some prior probability Πc and some probability distribution pc( ). The distribution pc( ) represents the probability that class c will produce each of the symbols of the prototype. The co-occurrence matrix, C, consists of elements, Ci,j, such that Ci,j is equal to the probability that a pair of symbols {Oi,Oj} occur in an equivalency set.
            C              i        ,        j              =                  ∑        k            ⁢                        Π          k                ⁢                  (                                                    P                k                            ⁢                              (                                  O                  i                                )                                      *                                          P                k                            ⁢                              (                                  O                  j                                )                                              )                      ,where Πk is the prior probability of class k, and Pk is the probability mass function (pmf) of class k.
Phase three of Stauffer's system is to separate the sample space into N distinct classes. The method successively splits the co-occurrence matrix into two new co-occurrence matrices, and the result is a full binary tree. The process uses the co-occurrence matrix, calculated in step two, to calculate two new probability mass functions that approximate the co-occurrence matrix. This process continues recursively down the binary tree. Given a co-occurrence matrix with element Ci,j, the two new pmfs are iteratively solved by minimizing the sum of squared error. Note N=1 in the following equations, since the goal is to split the pmf into two distinct classes.
            E      =                        ∑                      u            ,            v                    N                ⁢                              (                                          C                                  u                  ,                  v                                            -                              C                                  u                  ,                  v                                e                                      )                    2                      ,                  ⁢    when                      C                  i          ,          j                e            =                        ∑          c          N                ⁢                              Π            c                    ⁡                      (                                                            p                  c                                ⁡                                  (                  i                  )                                            *                                                p                  c                                ⁡                                  (                  j                  )                                                      )                                ,                  ⁢                  Π        c            =                                    (                          1              -                              α                π                                      )                    *                      Π            c                          +                              (                          α              π                        )                    *                                    ∑                              i                ,                j                                      ⁢                                          Π                c                            ⁡                              (                                                                            p                      c                                        ⁡                                          (                      i                      )                                                        *                                                            p                      c                                        ⁡                                          (                      j                      )                                                                      )                                                          ,                  ⁢                  p        c            =                                    (                          1              -                              α                p                                      )                    *                                    p              c                        ⁡                          (              i              )                                      +                              (                          α              p                        )                    *                                    ∑              j                        ⁢                                                            Π                  c                                ⁡                                  (                                                            p                      c                                        ⁡                                          (                      j                      )                                                        )                                            .                                          To calculate the co-occurrence matrices of the left and right children the following equations are used, respectivelyCi,j0=Ci,j*p0(i)*p0(j),Ci,j1=Ci,j*p1(i)*p1(j).
Stauffer's method successfully measures similarity in terms of probability instead of Euclidean distances. This allows Stauffer to easily combine multiple features into a single metric. However, Stauffer's method requires a large amount of memory and computing time to create and split co-occurrence matrices and to generate codebook statistics.
A color histogram is a measure of how often each color occurs in an image. Given a discrete m-dimensional color space, the color histogram is obtained by discretizing the colors present in the image and counting the number of times each color occurs in the image. Often the color space has dimension 3, and the image colors mapped to a given discrete color are contained in a 3-dimensional bin centered at that color. A successful color-matching algorithm will overcome most or all of the following problems that often degrade image indexing systems.                Distractions present in the object's background        Occlusions        Viewing the object from various viewpoints        Varying image resolution        Varying lighting conditionsDistractions in the object's background occur when parts of the background match the object. Often the military will camouflage tanks and jeeps to match the surrounding terrain. This type of distraction can severely degrade similarity metrics. An object becomes occluded when an object in the foreground blocks all or part of the object. The imaging sensor cannot detect occluded parts of an object. Also, since objects look different from various viewpoints, attempting to compare images of objects captured from different viewpoints can quickly degrade similarity metrics. In such cases multiple models of a particular object may need to be collected for accurate matching. For example, it may be difficult to match an image showing a person facing a camera with a picture of the same person facing away from the camera. Similarity metrics must also be able to identify objects that are rotated and translated within an image. This causes many template-matching algorithms to be computationally expensive.        
Changing image resolution can also hinder similarity metrics. As an object's distance from a camera increases, information about that object decreases. It is also difficult for humans to recognize low-resolution images. Varying lighting conditions can cause an object's appearance to change. Lights oscillating at some frequencies cause the specular reflectance of the object to vary with time. Outside brightness and shadow patterns change continuously. Color-constancy algorithms make it possible to perceive a constant color despite light variations. For most color histogram similarity metrics it is desirable to create histograms from color spaces that are uniform, compact, complete, and compatible with human perception of color. Common color spaces include                HSV        OPP        RGB        YUV        Munsell systemIn a uniform color space, distance between colors is directly proportional to the psychological similarity between the colors. A color space is compact when there is a perceptual difference between each color. A complete color space contains all perceptible colors. Finally, color spaces that appeal to the human visual system represent colors in a human perceptual way. Representing colors by hue, saturation, and intensity often does this.        
Quantization is an important issue to consider when choosing a color space for object similarity matching via histogram matching. Uniform quantization of individual color space components is the most obvious quantization scheme and most reasonable method when no a prior knowledge of the color space exists. However color distributions are not uniform. Uniform quantization can be inefficient and it can degrade the performance of the similarity metric. Use Vector Quantization (VQ) to quantize the color space in a manner that minimizes the mean-squared quantization error between pixels in the images and pixels in the quantized images. Minimization is based on quantizing the color space into a new set of N color points. Note that this technique becomes impractical for large images databases.
Mathias performed several experiments to determine what combination of color spaces, quantization schemes, and similarity metrics work best for image indexing systems. To evaluate and compare each set, Mathias developed a criterion function based on the number of false negatives, search efficiency, and computational complexity. A false negative occurs when the system has not located every image that contains visually similar colors. Visually similar images correspond to human perceptual similarity. Efficiency measures the average number of images that must be viewed to see a given number of correct images. Computational complexity is measured by the amount of time it takes to calculate similarity metrics and to index the image database. Mathias results showed that histogram intersection and the city-block metric provided the most accurate results with the best response times. These results were obtained when using the HSV color space quantized to 216 bins. Twenty combinations of color spaces, quantization schemes, and similarity metrics were evaluated in the study.
Five similarity metrics were analyzed:                City-block metric        Euclidean metric        Histogram Intersection        Average color distance        Quadratic distance measure        
Three-color spaces were analyzed                CIEL*u*v*        HSV        OPPEach space was uniformly quantized. The CIEL*u*v* space was quantized into 4×8×8 partitions, which corresponds to a 256 bin histogram. The HSV space was quantized into 18×3×3 (162 bins) and also 24×3×3 (216 bins). The opponent space was partitioned into 256 bins, where the wb, rb, and by coordinates were divided into 4, 8, and 8 bins respectively.        
Histogram Intersection is a similarity metric used to compare an image or object within an image to every model in a database. Given two histograms, P and M, containing n bins, Histogram Intersection is defined asd(P,M)=Σl=1n min(p1, m1).This is the number of pixels from the model, M, that have corresponding pixels of the same color in the image, P. When images contained within the database vary in size the metric is normalized.
      d    ⁢          (              P        ,        M            )        =                    ∑                  1          =          1                n            ⁢              min        ⁢                  (                                    p              1                        ,                          m              1                                )                                    ∑                  1          =          1                n            ⁢              m        1            Histogram Intersection is robust to the following problems that often degrade image indexing algorithms                Distractions present in the object's background        Occlusions        Viewing the object from various viewpoints        Varying image resolution        
The similarity is only increased when a given pixel has the same color as one of the colors in the model, or when the total number of pixels used to represent that color in the object is less than the number of pixels of that color in the model. The method is robust to scale changes, but not independent of such changes. Histogram Intersection is not robust to varying lighting conditions. Various histogram intersection algorithms employ color constancy techniques.
Histogram Intersection can be related to the city-block similarity metric on an n dimensional feature space. When the histograms are scaled to be the same size, Histogram Intersection is equivalent to the city-block distance metric.
      1    -          d      ⁡              (                  P          ,          M                )              =                    1                  2          ⁢          T                    ⁢                        ∑                      1            =            1                    n                ⁢                                                                        p                1                            -                              m                1                                                          ⁢                                          ⁢          where          ⁢                                          ⁢          T                      =                            ∑                      1            =            1                    n                ⁢                  p          1                    =                        ∑                      1            =            1                    n                ⁢                              m            1                    .                    
However, the foregoing image sorting and matching methods fail to provide a simple object recognition method for an automatic vision system.