1. Field of the Invention
The present invention relates to a method of and an apparatus for segmenting a pixellated image into at least one foreground region and at least one background region. Such techniques may be used in the field of video compression in order to reduce the data rate and/or improve compression quality of foreground regions. Such techniques may also be used to compose new image sequences by replacing a segmented background with another background image or another sequence of background scenes. Further possible applications include video communication, video conferencing, television broadcasting, Internet multimedia applications, MPEG-4 applications, face detection applications and real time video tracking systems such as observer tracking autostereoscopic 3D displays. A specific application of such techniques is in digital video cameras and other digital image capture and recording devices for multimedia applications. An example of such a device is the Sharps(copyright) Internet ViewCam.
2. Description of the Related Art
Many known image processing and analysis applications involve image sequences which contain foreground objects, which are normally temporally active, and a background region, which is relatively static. Parts of the background scene may be covered and/or uncovered as the foreground objects move and/or change shape. It is very useful for these applications to have the capability to segment the images into foreground and background regions.
The Sharp(copyright) Corporation Internet ViewCam VN-EZ1 is an MPEG-4 digital recorder made for multimedia applications. This recorder enables computer users to incorporate moving pictures into their multimedia applications, such as home pages, Internet broadcasts, and e-mail communications. This recorder uses the MPEG-4 digital moving picture compression standard and Microsoft(copyright) Advanced Streaming Format to produce moving picture files that are small in size and thus more practical for Internet distribution. The video data are recorded onto SmartMedia(trademark) memory cards, offering approximately one hour of recording time.
A successful segmentation, for example, would enable different compression techniques to be applied to the foreground and background regions. A higher compression ratio may then be achieved, enabling a longer recording time with an improved quality in the foreground regions. In addition, the background regions may be replaced with other scenes to produce a special effect to enhance attractiveness to consumers.
Earlier systems performed segmentation by using a carefully controlled background such as a uniformly coloured screen or a brightly illuminated backing behind the foreground objects. For example, U.S. Pat. No. 5,808,682 discloses a data compressing system which segments the foreground objects from a special background, which is illuminated uniformly by a known colour. Any colour may be used but blue has been the most popular. Therefore this type of coloured backing is often referred to as blue backing. The foreground objects can then be segmented using well known chroma key technology.
On large coloured backing, it is not a simple matter to achieve uniform illumination. U.S. Pat. No. 5,424,781 discloses a linear image compositing system which corrects for non-uniform luminance and/or colour of the coloured backing without incurring the penalties of edge glow, edge darkening, loss of edge detail and other anomalies.
For black-and-white images, it is known to use a controlled background so as to try to separate the foreground objects and the background scene into two different ranges of the grey scale. Typically the segmentation may be achieved by finding a deep valley in the histogram of the grey levels Nobuyuki Otsu xe2x80x9cA threshold selection method from grey-level histogramsxe2x80x9d, IEEE Trans. on Systems, Man and Cybernetics, Vol. SME-9, No. 1, January 1979 pp. 62-66 discloses such a method to find an optimal threshold to segment the foreground objects from their background. FIG. 1 of the accompanying drawings illustrates a histogram of this type in which h(t) represents the number of pixels and t represents the amplitude of the pixel values. The controlled background is such that the majority of the background pixels have relatively low levels whereas the foreground pixels have levels which tend to occupy a higher range. Otsu attempts to define a threshold T in the valley between the two ranges.
There are several problems with this technique, For example, although FIG. 1 Indicates that a well-defined valley exists between the background and foreground grey level ranges, this is only the case for very carefully controlled backgrounds and possibly some but certainly not all foregrounds.
If this technique is not restricted to very carefully controlled conditions, then the problems become more severe. In particular, for many if not all images to be segmented, significant numbers of foreground pixels will have levels extending below the threshold whereas significant numbers of background pixels will have levels extending above the threshold. Thus, any threshold T which is chosen will lead to incorrect segmentation.
Another technique for segmenting an image is disclosed in T Fugimoto et al xe2x80x9cA method for removing background regions from moving imagesxe2x80x9d, SPIE vol. 1606 Visual communications and image processing 1991, imaging processing, pp. 599-606. This technique makes use of both the level and polarity of the pixel values in order to be resistant to lighting intensity fluctuations.
FIG. 2 of the accompanying drawings is a histogram with the same axes as FIG. 1 but illustrating the effect of lighting intensity fluctuations. In the absence of such fluctuations, the distribution illustrated in the histogram has a narrow peak centred on the vertical axis with symmetrically sloping sides. When a lighting intensity fluctuation occurs, this peak becomes offset horizontally. The technique of Fugimoto et al is to derive asymmetrical positive and negative thresholds T1 and T2 by matching a Gaussian distribution to the actual position of the peak and simulating the remainder of the curve, which is assumed to represent foreground pixel levels, with a constant function. The intersection between the gaussian distribution and the constant function gives the threshold values T1 and T2 for the image being processed. It is then assumed that all pixel values between the thresholds represent noise.
This technique suffers from the same problems as Otsu. Although it may be resistant to lighting intensity fluctuations, the selection of the thresholds cannot be made in such a way that every image which is likely to be encountered will be correctly segmented.
U.S. Pat. No. 5,878,163 discloses an imaging target tracker and a method of determining thresholds that are used to optimally distinguish a target from its background. The target is assumed to occupy a gray level region which is identified from two histograms corresponding to the inner and outer regions of the target, respectively. Both histograms are recursively smoothed and a lookup table of actually observed pixel values is then computed. Two optimal thresholds are selected and are set at respective ends of histogram segments. The likelihood maps adapt over time to the signature of the target. The grey-level distribution of the target is used to select thresholds that pass a band of grey levels whose likelihood of their belonging to the target is high. It is not necessary for an accurate segmentation for this type of application.
While these methods may achieve reasonable results of segmentation for the desired applications and are usually computationally efficient, the requirement of having a carefully controlled background that can be distinguished from the target in either intensity or colour severely limits the range of the applications available.
A more challenging task is therefore how to segment the foreground objects from the background of a general scene. These methods often require the calculation of a difference image which characterises the difference between the current frame and a predetermined frame. The predetermined frame could be either a pre-recorded image of the background, or the previous frame, or an image generated from a number of the previous frames. U.S. Pat. No. 5,914,748 discloses an electronic compositing system for inserting a subject into a different background. The method subtracts from each image of the sequence a pre-recorded image of the background to generate a difference image. A mask image is then generated by thresholding this difference image. The mask image is used to segment the foreground objects from their background. The method is simple to implement but may require manual correction by users to remove large artefacts in both the segmented foreground regions and the background regions.
In terms of computer implementation, the segmentation of the foreground and background regions may be performed at either a pixel-based level or a block-wise level. Block-wise segmentation divides an image into blocks, each comprising a number of pixels which are all classified as either foreground or background pixels. Pixel-based and block-wise methods have their own advantages and disadvantages. For example, pixel-based segmentation can follow the boundaries of foreground objects more closely but may not have good connectivity and is more prone to noise. On the other hand, block-wise methods have fewer artefacts in the segmented foreground and background regions, but may have a poorer performance around the boundaries. Sometimes it is possible to combine these two approaches, with different combinations yielding different results depending on applications.
In data compression systems, block-wise coding methods such as the discrete cosine transform and its variants normally operate on square blocks of data, making a segmentation of the image into temporally active/inactive regions composed of square sub-segments desirable. Sauer and Jones xe2x80x9cBayesian block-wise segmentation of interframe differences in video sequencesxe2x80x9d, CVGIP: Graphics and Image Processing, Vol. 55, No. 2, March 1993, pp. 129-139 disclose a Bayesian algorithm for segmenting images of a video sequence into blocks chosen as static background and dynamic foreground for the sake of differential coding of temporally dynamic and static regions. In this application, regions that are temporally active are defined as xe2x80x9cforegroundxe2x80x9d and otherwise as xe2x80x9cbackgroundxe2x80x9d, so that parts of or the whole of a foreground object may become background regions where there are no changes over these regions. This method models the data as random fields at two levels of resolution. The interframe difference at each pixel is first thresholded, yielding a binary image. The natural spatial correlation of image data is captured by a Markov random field model on this field of binary-valued pixels in the form of the classical Ising model. At the second level of resolution, the field consisting of blocks which exhibit correlation among neighbours is also described by a Markov model.
U.S. Pat. No. 5,915,044 discloses a video-encoding system that corrects for the gain associated with video cameras that perform automatic gain control. The gain-corrected images are analysed to identify blocks that correspond to foreground objects and those that correspond to the background scene. This foreground/background segmentation may be used to determine how to encode the image and may also be used during the gain-control correction of the subsequent video images. The segmentation analysis is carried out both at pixel-level and at block-level. At the pixel level, pixel differences between the current frame and a reference frame are thresholded to yield a pixel mask indicating changed pixels. The reference frame is then generated from the averaged values of a number of the previous frames. The block-level takes the pixel-level results and classifies blocks of pixels as foreground or background, which is natural for a block-based compression scheme. The basis for classification is the assumption that significantly changed pixels should occur only in the foreground objects. A threshold is generated by considering a maximum likelihood estimate of changed regions, based on zero-mean Gaussian-distributed random variable modelling. A morphological filter is applied to decrease false foreground detection before block level processing is applied to classify each block as belonging to the foreground or the background. This application does not require very accurate detection of the foreground objects. The main purpose is to separate temporally changing regions from static regions so that they can be encoded differently.
In general, these methods tend to be computationally expensive and may not be suitable for real-time applications such as the Sharp(copyright) Corporation Internet ViewCam, which has limited computing power and memory storage. The robustness of these methods may be limited, often requiring manual user correction, Whereas pixel-based methods tend to leave artefacts in both the segmented foreground and background, block-wise methods tend to produce ragged boundaries.
According to a first aspect of the invention, there is provided a method of segmenting a pixellated image, comprising the steps of:
(a) selecting at least one first region from a first reference image;
(b) deriving from values of pixels of the at least one first region a first threshold such that a first predetermined portion of the pixels have values on a first side of the first threshold;
(c) forming a difference image as a difference between each pixel of the image and a corresponding pixel of an Image of a non-occluded background; and
(d) allocating each difference image pixel to at least one first type of region if the value of the difference image pixel is on the first side of the first threshold and the values of more than a first predetermined number of neighbouring difference image pixels are on the first side of the first threshold.
The first predetermined proportion may be between 0.5 and 1. The first predetermined proportion may be substantially equal to 0.75.
The first predetermined number may be substantially equal to half the number of neighbouring difference image pixels.
Each of the at least one first region and the at least one first type of region may comprise at least one background region and the first side of the first threshold may be below the first threshold. The first reference image may comprise the difference between two images of the non-occluded background and the at least one first region may comprise substantially the whole of the first reference image.
The at least one first region may be automatically selected. The at least one first region may comprise at least one side portion of the first reference image,
The at least on first region may be manually selected.
The neighbouring pixels in the step (d) may be disposed in an array with the difference image pixel location substantially at the centre of the array.
The method may comprise repeating the steps (a) to (d) for a sequence of images having a common background. The first reference image may be the preceding difference image. The at least one first region may comprise the at least one first type of region of the preceding step (d). Each step (d) may comprise forming a first initial histogram of values of the difference image pixels allocated to the at least one first type of region and the step (b) may derive the first threshold from a first resulting histogram which comprises the sum of the first initial histogram formed in the preceding step (d) and a first predetermined fraction less than 1 of the first resulting histogram of the preceding step (b). The first predetermined fraction may be a half.
The method may comprise the steps of:
(e) selecting at least one second region from a second reference image;
(f) deriving from the values of pixels of the at least one second region a second threshold such that a second predetermined proportion of the pixels have values on a second side opposite the first side of the second threshold; and
(g) allocating each difference image pixel, which is not allocated to the at least one first type of region, to at least one second type of region if the value of the difference image pixel is on the second side of the second threshold and the values of more than a second predetermined number of neighbouring difference image pixels are on the second side of the second threshold.
The second predetermined proportion may be between 0.5 and 1. The second predetermined proportion may be substantially equal to 0.75.
The second predetermined number may be substantially equal to half the number of neighbouring difference image pixels.
The at least one second region may be automatically selected. The at least one second region may comprise a middle portion of the second reference image. The at least one second region may be manually selected.
The second reference image may comprise the first reference image.
The neighbouring pixels in the step (g) may be disposed in an array with the difference image pixel location substantially at the centre of the array.
The method may comprise repeating the steps (e) to (g) for a sequence of images having a common background. The second reference image may be the preceding difference image. The at least one second region may comprise the at least one second type of region of the preceding step (g). Each step (g) may comprise forming a second initial histogram of values of the difference image pixels allocated to the at least one second type of region and the step (f) may derive the second threshold from a second resulting histogram which comprises the sum of the second initial histogram formed in the preceding step (g) and a second predetermined fraction less than 1 of the second resulting histogram of the preceding step (f) The second predetermined fraction may be a half.
The method may comprise allocating each difference image pixel, which is not allocated to the at least one first type of region and which is not allocated to the at least one second type of region, as a candidate first type of pixel if a value of the difference image pixel is less than a third threshold.
The third threshold may be between the first and second thresholds. The third threshold may be the arithmetic mean of the first and second thresholds.
The method may comprise allocating each difference image pixel, which is not allocated to the at least one first type of region and which is not allocated to the at least one second type of region, to the at least one first type of region if more than a third predetermined number of the neighbouring pixels are allocated to the at least one first type of region or as candidate first type of pixels.
The neighbouring pixels may comprise an array of pixels with the difference image pixel location substantially at the centre of the array.
The third predetermined number may be half the number of neighbouring difference image pixels.
The method may comprise allocating each difference image pixel, which is not allocated to the at least one first type of region and which is not allocated to the at least one second type of region, to the at least one second type of region.
The or each image and the background image may be grey level images and the step (c) may form the difference between each image pixel and the corresponding background pixel as the difference between the grey level of each image pixel and the grey level of the corresponding background pixel.
The step (c) may comprise performing a moving window averaging step may on the or each difference image.
The image to be segmented may be a colour component image and the moving window averaging step may be performed on each of the colour components.
The or each image and the background image may be colour images and the step (c) may form the difference between each image pixel and the corresponding background pixel as a colour distance between the colour of each image pixel and the colour of the corresponding background pixel. The colour distance may be formed as:       ∑          i      =      1        n    ⁢      xe2x80x83    ⁢            a      1        ⁢          "LeftBracketingBar"                        I          1                -                  B          1                    "RightBracketingBar"      
where n is the number of colour components of each pixel, Ii is the ith colour component of an image pixel, Bi is the ith colour component of a background pixel and xcex1i is a weighting factor. Each xcex1i may be equal to 1. n may be equal to 3, I1 and B1 may be red colour components, I2 and B2 may be green colour components and I3 and B3 may be blue colour components.
The step (c) may form colour component difference images I1xe2x88x92B1 and may perform a moving window averaging step on each of the colour component difference images.
The window may have a size of 3xc3x973 pixels.
The method may comprise forming a binary mask whose elements correspond to difference image pixels, each element having a first value if the corresponding difference image pixel is allocated to the at least one first type of region and a second value different from the first value if the corresponding difference image pixel is allocated to the at least one second type of region.
The method may comprise replacing the value of each pixel of the or each image corresponding to a difference image pixel allocated to the at least one background region with the value of the corresponding background image pixel.
The method may comprise replacing the value of each pixel of the or each image corresponding to a difference image pixel allocated to the at least one background region with the value of a corresponding pixel of a different background.
The method may comprise replacing the value of each pixel of the or each image corresponding to a difference image pixel allocated to a boundary of at least one foreground region with a linear combination of the value of the image pixel and the value of the corresponding different background pixel. The linear combination may comprise the arithmetic mean of the or each pair of corresponding pixel component values.
The method may comprise, for each colour component, forming a distribution of the differences between the colour component values of the pixels allocated to the at least one background region and the corresponding pixels of the non-occluded background image, determining a shift in the location of a peak in the distribution from a predetermined location, and correcting the colour component values of the pixels allocated to the at least one background region in accordance with the shift.
According to a second aspect of the invention, there is provided an apparatus for segmenting a pixellated image, comprising means for selecting at least one first region from a first reference image, means for deriving from values of pixels of the at least one first region a first threshold such that a predetermined proportion of the pixels have values on a first side of the first threshold, means for forming a difference image as a difference between each pixel of the image and a corresponding pixel of an image of a non-occluded background, and means for allocating each difference image pixel to the at least one first type of region if the value of the difference image pixel is on the first side of the first threshold and the values of more than a first predetermined number of neighbouring difference image pixels are on the first side of the first threshold.
According to a third aspect of the invention, there is provided an apparatus for segmenting a pixellated image, comprising a programmable data processor and a storage medium containing a program for controlling the data processor to perform a method according to the first aspect of the invention.
According to a fourth aspect of the invention, there is provided a storage medium containing a program for controlling a data processor to perform a method according to the first aspect of the invention.
According to a fifth aspect of the invention, there is provided a program for controlling a data processor to perform a method according to the first aspect of the invention.
According to a sixth aspect of the invention, there is provided an image capture device including an apparatus according to the second or third aspect of the invention.
It is thus possible to provide a robust technique for segmenting foreground and background regions of an image or a sequence of images. This may be partially achieved by combining the advantages of pixel-based and block-wise methods to produce good boundaries around the segmented foreground region or regions and few artefacts in both the foreground and background regions.
The robustness is also achieved by the use of a step-by-step approach which first identifies pixels that may be classified more reliably and easily than others. As more pixels are allocated, a better determination of the remaining pixels may be achieved.
By selecting the regions which are used for determining the thresholds, the or each threshold can be determined more accurately so as to improve the segmentation. For example, where the first threshold is used to determine background pixels, the threshold itself can be determined largely or wholly from background regions and so is not affected at all or substantially by the values of pixels in foreground regions. The second threshold when used may likewise be determined on the basis of pixel values in foreground regions so that improved segmentation of an image may be achieved. Each image may be processed recursively so that the improved segmentation leads to improved threshold selection and the improved threshold selection leads to improved segmentation. Such a recursive approach is possible in real time if sufficient computing power is available. Otherwise, such a recursive approach is limited to off-line or non-real time applications.
When processing sequences of images, the or each threshold may be determined by contributions from several or all preceding images so as to improve the threshold selection and hence the image segmentation. For example, when forming histograms for determining the or each threshold, each histogram may comprise the present histogram and a fraction, such as half, of the previous histogram so that the influence of each recursion is reduced with time but the effect on threshold selection is not excessively dominated by an unsatisfactory image, for example having a relatively small background or foreground region which might otherwise distort the threshold selection. Thus, the robustness may be self-improved as the segmentation results improve the estimation of the statistical property of the noise in the background and the signal strength of the foreground. The improved estimation in turn improves the segmentation of the next image, thus forming a loop of continuous improvement. A controlled background is not required and it is possible to deal with any background of a general scene which may include gradual changes with respect to the dynamic changes of foreground objects.
The determination of the thresholds may be related directly to the filtering process after each thresholding operation. No complicated statistical models are required so that the technique is easy to implement.
This technique can be implemented in a computationally efficient way in terms of computing power and memory requirement and involves only simple arithmetic operations, which may be implemented exclusively using integers. This makes it very suitable for real-time applications, such as in the Sharp(copyright) Corporation MPEG-4 Internet ViewCam, which has limited computing power and relatively small memory storage, or in other image capture and recording devices for multimedia applications.
This technique may be used in video tracking and face detection applications, for example as disclosed in EP0877274, GB2324428, EP0932114 and GB233590. For example, segmented foreground regions may be used to limit the searching area for locating faces in an image. This may be used in connection with a real time video tracking system, for example as disclosed in European Patent Application No. 99306962.4 and British Patent Application No. 9819323.8.