In general, image processing generally refers to the taking either real world images captured by devices such as cameras, infrared sensors and ultrasound scanners or computer-generated images created by computer graphics, modeling or animation software packages and manipulating those images in order to achieve a desired result. On the other hand, video processing generally involves taking video sequences captured by an analog or a digital camera, which can be viewed as a collection of still images or frames that contain independently moving objects, and extracting useful information about an object of interest. Such information can be used for storage and retrieval of video sequences in a multimedia environment or as an input to compression algorithms, depending on the specific needs of the application.
Advances in modern multimedia technologies over the last few years have led to a dramatic growth of digitally stored data, such as archives of images, audio and video, and the exchange of such data over communication networks. Numerous applications in diverse fields such as medicine, remote sensing, education, video-on-demand, video conferencing, high definition television (HDTV), on-line information services and entertainment, require the manipulation, storage and retrieval of visual data.
An important task in multimedia applications, such as multimedia databases, is indexing and accessing images and, significantly, being able to perform this indexing and accessing of images or videos quickly. For example, news broadcasting television stations frequently store footage of news stories in video databases. To properly index the videos, the television station must have knowledge of all the subjects contained in the videos, the frames in which they are present, as well as the location of the objects of interest within each frame. To manually search and index each video footage and divide it according to the subject is a very tedious and time-consuming task since each video sequence is composed by thousands of individual frames. Video processing algorithms have accordingly been developed which can dramatically reduce the amount of time required for such a task. With these algorithms, the user selects the location of an object of interest in the first frame and the algorithm tracks the object in all of the subsequent frames. With the information of the location of the subject in each frame available and with the frames in which the object appears identified, the indexing and later retrieval of each video sequence can be more easily achieved.
Another use of video processing is in compression algorithms. Video sequences form large data files that require a large amount of transmission bandwidth as well as storage requirements. As a result, the development of efficient compression algorithms is a crucial task in video processing and has been an active field of research over the last ten years. Several standards have emerged for video compression, such as H.263 and the MPEG compression family. There are two types of redundancy in video sequences, spatial and temporal. Compression can be achieved by exploiting those redundancies. Temporal redundancies are usually removed by using motion estimation and compensation algorithms. Motion estimation techniques take the location of an object of interest in the current frame as an input and calculate the new position of the object in the next frame. The motion is described in terms of a motion vector, which is the signed difference between the current and the next position. Motion compensation attempts to predict subsequent frames at the decoder level on the basis of already decoded frames and the estimation of the object's motion received from the coder. In the context of the currently emerging MPEG-4 compression standard, there is a great deal of interest in content-based manipulations and object-based video coding. After objects in a scene have been identified by their contour, the current frame is divided into regions and the motion of each region is calculated with respect to the previous frame or the next frame in the video sequence.
While the human visual system can easily distinguish between moving objects, computer-based object tracking remains a challenge. Several approaches to motion estimation have been developed over the last few years, such as optical field methods, baysian methods and block-based methods. Block-based motion estimation and compensation in particular are among the most popular approaches due primarily to their more simple hardware. As a result, block-based motion estimation has been adopted by the international standards for digital video compression, such as with H.261, H.263 and the MPEG family.
The block motion model assumes that the image is composed of moving blocks. In block matching, the best motion vector estimate is found by a pixel-domain search procedure. According to this procedure, the displacement of a pixel at (n1, n2) in a current frame k is determined first by considering an N1×N2 block centered around (n1, n2). Next, a search is performed at a search frame at k+1 for the location of the best matching block of the same size. The search is usually limited for computational reasons to a region (N1+2M1)×(N2+2M2) called the search window, where M1 and M2 are predefined integers that determine the window size. Block-matching algorithms differ in the matching criteria, the search strategy, and the determination of block size.
The matching of the blocks can be quantified according to several criteria such as maximum cross-correlation, minimum square error, mean absolute difference and maximum matching pel count. Finding the best-matching block requires the optimization of the chosen matching criterion over all possible candidate displacement vectors at each pixel (n1, n2). One way of achieving this is by a full-search method, which evaluates the matching criterion for every location within the search window. Although the full-search is guaranteed to find the best-matching block, it requires a great amount of processing, is extremely time-consuming, and is therefore highly impractical for real-time response systems. In most cases, faster search strategies are utilized, even though they often lead to sub-optimal solutions. One of these faster search methods is a three-step search, a popular algorithm of the logarithmic search family. With the three-step search, instead of searching the entire window for the best match, it only calculates the similarity measure at nine evenly distributed positions. The best matching position from one step becomes the starting point for the next one step. One limitation of the three-search step is that it may not find the global minimum, which is the best-matching block in the entire search window, but may instead get trapped in a local minimum.
The selection of an appropriate block size is essential for any block-based motion estimation algorithm and is influenced by a number of factors, some of which impose conflicting requirements on the size of the search blocks. If the blocks are too small, a false match may be established between blocks containing similar gray-level patterns, which are unrelated in terms of motion. On the other hand, if the blocks are too big, then actual motion vectors may vary within the block, violating the basic assumption of a single motion vector per block.
To address the problem of selecting the optimal block size, a hierarchical block-matching algorithm has been developed that uses a multi-resolution representation of frames in the form of a Laplacian pyramid or wavelet transform. The basic idea of hierarchical block-matching is to perform motion estimation at each level of resolution successively, starting at the lowest resolution level which serves as a rough estimate of the displacement vector and then using relatively smaller blocks. The estimate at one level is then passed on to a higher resolution level as an initial estimate. The higher resolution levels serve as a refinement of the initial displacement vector estimate. A drawback of hierarchical block-matching methods is that they require additional computations for calculating the sub-sampled representations of each frame plus additional memory storage. Thus, the better performance of the hierarchical block-matching algorithm can be outweighed by the increase of running time.