The present invention relates to apparatus and methods for registration, motion detection, tracking and background replacement.
The state of the art is believed to be represented by the following publications:
H. S. Sawhney and R. Kumar. True Multi Image Alignment and its Application to Mosaicing and Lens Distortion. In Computer Vision and Pattern Recognition, pages 450-456, 1997;
P. Anandan. A Computational Framework and an Algorithm for the Measurement of Visual Motion. Int. J. of Computer Vision 2, pages 283-310, 1989;
M. Irani, B. Rousso and S. Peleg. Computing Occluding and Transparent Motions. Int. J. of Computer Vision, 12 No. 2, pages 5-16, January 1994;
E. Shilat, M. Werman and Y. Gdalyahu, Ridges"" Corner Detection and Correspondence. In Computer Vision and Pattern Recognition, pages 976-981, 1997;
H. Wang and M. Brady. Real-Time Cornet Detection Algorithm for Motion Estimation. Image and Vision Computing 13 No. 9, pages 695-703, 1995;
Y. Rosenberg and M. Werman. Representing Local Motion as a Probability Matrix and Object Tracking. In Darpa Image Understanding Work Shop, pages 153-158, 1997;
M. Ben-Ezra, S. Peleg and M. Werman. Efficient Computation of the Most Probable Motion from Fuzzy Correspondences. Workshop on Application of Computer Vision, 1998;
M. Irani and P. Ananda, xe2x80x9cRobust multi-sensor mage alignmentxe2x80x9d, Proceedings of International Conference on Computer Vision, January 1998.
xe2x80x9cBlue-screenxe2x80x9d background replacement is known.
U.S. Pat. No. 5,764,306 to Steffano describes a real time method of digitally altering a video data stream to remove portions of the original image and substitute elements to create a new image. Steffano describes real time replacement of the designated background portion of an incoming video signal with an alternate background. The actual background image is utilized for reference as the basis for determining the background and foreground elements within the image with the end result being comparable to traditional bluescreen processes, but requiring only a personal computer, video camera and software. The reference background image can be any reasonably static scene with a sufficient and stable light source captured by the camera. The video data stream is modified in real time by comparisons against the reference background image and is then passed on to its original destination. Multiple signal-noise processing algorithms are applied in real time against the signal to achieve a visually acceptable matte.
The disclosures of all publications mentioned in the specification and of the publications cited therein are hereby incorporated by reference.
The present invention seeks to provide a fast and robust method for image registration and motion detection based on discrete representation of the local motion. This allows the implementation of a real-time system on a PC computer which can register images and detect and track a moving object in video images, even when the camera is moving.
There is thus provided, in accordance with a preferred embodiment of the present invention, a method for registration between first and second images, the method including defining, for each individual location from among a plurality of locations sparsely distributed over the first image, a local probability matrix in which each element represents the probability of a possible displacement between he individual location in the first image and its corresponding location within the second image, defining a combined probability matrix by combining corresponding elements over the plurality of probability matrices, and computing an alignment of the first and second images in accordance with a combination of at least one of the largest of the elements of the combined probability matrix.
The above method is particularly suitable for applications in which the images can be assumed to be translated only relative to one another.
Further in accordance with a preferred embodiment of the present invention, the corresponding elements which are combined in the combined probability matrix defining step include elements within the local probability matrices which are similarly positioned if each individual local probability matrix is shifted to represent the effect on the individual location corresponding to the individual matrix, of a particular non-translational transformation between the first and second images.
The above method is particularly suitable for applications in which the images may be translated relative to one another and might be additionally rotated to a certain typically estimable extent.
Still further in accordance with a preferred embodiment of the present invention, the method also includes repeating the combined probability matrix defining step for each of a plurality of possible non-translational transformations between the first and second images, and selecting at least one most likely non-translational transformation from among the plurality of possible non-translational transformations, and the step of computing an alignment includes computing a relative non-translational transformation of the first and second images by computing a combination of the at least one most likely non-translational transformation, and computing a relative translation of the first and second images by computing a combination of at least one of the largest of the elements of the at least one combined probability matrices of the at least one most likely non-translational transformations.
Further in accordance with a preferred embodiment of the present invention, the step of selecting at least one most likely non-translational transformations from among the plurality of possible non-translational transformations includes comparing a set of at least one of the largest of the elements in each of the combined probability matrices of each of the plurality of possible non-translational transformations, selecting at least one set from among the compared sets whose members are largest, and selecting as most likely non-translational transformations, the at least one non-translational transformation corresponding to the at least one set whose members are largest.
Still further in accordance with a preferred embodiment of the present invention, the probability matrix is characterized in that each i,j element therewithin represents the probabilities that the individual point corresponds to an individual point in the second image, which is displaced correspondingly to the displacement of the i,j element from the center of the probability matrix.
Further in accordance with a preferred embodiment of the present invention, the corresponding elements which are combined include similarly positioned elements within the local probability matrices.
Still further in accordance with a preferred embodiment of the present invention, the method also includes executing the alignment.
Additionally in accordance with a preferred embodiment of the present invention, the method also includes executing the alignment by effecting the relative non-translational transformation and the relative translation of the first and second images.
Still further in accordance with a preferred embodiment of the present invention, the plurality of possible non-translational transformations between the first and second images includes at least one relative rotation between the first and second images.
Additionally in accordance with a preferred embodiment of the present invention, the plurality of possible non-translational transformations between the first and second images includes at least one relative zoom between the first and second images.
Further in accordance with a preferred embodiment of the present invention, the plurality of possible non-translational transformations between the first and second images includes at least one transformation which includes a combination of zoom and rotation between the first and second images.
Still further in accordance with a preferred embodiment of the present invention, the plurality of possible non-translational transformations between the first and second images includes at least one affine transformation between the first and second images.
Also provided, in accordance with another preferred embodiment of the present invention, is a method for detecting motion within a scene by comparing first and second time-separated images of the scene, the method including defining, for each individual location from among a plurality of locations distributed over the first image, defining a local probability matrix in which each element represents the probability of a possible displacement between the individual location in the first image, representing an individual portion of the scene in the first image, and its corresponding location within the second image, and ranking the local probability matrices into a plurality of ranks of matrices, differing in the probability that the individual location corresponding to a matrix belonging to a rank was displaced between the first and second images, relative to what is known regarding camera motion between the first and second images.
The above method is particularly suited for applications in which the camera is assumed stationary.
Further in accordance with a preferred embodiment of the present invention, the ranking step includes comparing the center region of each local probability matrix to the peripheral regions thereof.
Still further in accordance with a preferred embodiment of the present invention, the ranking step includes constructing a combined probability matrix in which each element represents the probability of a possible camera motion-caused displacement between the first image and the second image, and ranking the local probability matrices in accordance with the degree to which they respectively resemble the combined probability matrix.
The above method is particularly suited to applications in which the camera cannot be assumed to be stationary.
Further in accordance with a preferred embodiment of the present invention, the method also includes deriving the second image from a third image of the scene, separated in time from the first image of the scene, including selecting a transformation which, when applied to the third image, results in an image aligned generally to the first image, in the sense that the two images, if not totally aligned, can be brought into alignment by applying a translation, and applying the transformation to the third image, thereby to derive the second image.
Still further in accordance with a preferred embodiment of the present invention, the transformation has a non-translational component. Additionally in accordance with a preferred embodiment of the present invention, the transformation having a non-translational component is a non-translational transformation.
Also provided, in accordance with another preferred embodiment of the present invention, is a system for registration between first and second images, the system including a local probability matrix generator operative to define, for each individual location from among a plurality of locations sparsely distributed over the first image, a local probability matrix in which each element represents the probability of a possible displacement between said individual location in the first image and its corresponding location within the second image, a combined probability matrix generator defining a combined probability matrix by combining corresponding elements over the plurality of probability matrices, and an image aligner computing an alignment of the first and second images in accordance with a combination of at least one of the largest of the elements of the combined probability matrix.
Also provided, in accordance with another preferred embodiment of the present invention, is a system for detecting motion within a scene including at least one moving objects by comparing first and second time-separated images of the scene, the system including a local probability matrix generator defining, for each individual location from among a plurality of locations distributed over the first image, a local probability matrix in which each element represents the probability of a possible displacement between said individual location in the first image, representing an individual portion of the scene in the first image, and its corresponding location within the second image, and a location displacement evaluation unit operative to rank the local probability matrices into a plurality of ranks of matrices, differing in the probability that the individual location corresponding to a matrix belonging to a ran, was displaced between the first and second images, relative to what is known regarding camera motion between the first and second images.
The present invention also seeks to provide preferred methods and systems for replacing background in a movie of a subject in front of an arbitrary background, in applications including but not limited to videoconferencing, Internet chats, videophone and teaching.
Many applications, from videoconferencing to videophone transmit a video of a speaker in front of a background. These applications can benefit from the ability to change the background and transmit a different one in order to hide the original background, to create a desired illusion or just for variety (mainly for Internet chats).
A background replacement is used in virtual studios used for the television and movies industries, where special screens, with a typical color, are used as the original background (to be replaced). This document discusses a real-time replacement of a given uncontrolled background with a new one. Each of the subscribers taking part in a conversation can choose to change her/his background.
A background replacement system, according to a preferred embodiment of the present invention, receives a movie of a subject in front of a background, and creates, in real time, a new sequence in which the subject is unchanged and a new background that was pre-chosen by the user replaces the original one. The new background can be static or a dynamic one.
Background replacement typically includes the following steps:
I. Creating an alternative background as desired. This new, static or dynamic, background is transmitted only once.
II. ((For each frame in the movie) Separating the subject from the original background.
III. (For each frame in the movie) Blending the subject with the new background. Separating the subject from the original background is done for each frame. An insufficient separation might cause artifacts such as leaving parts of the original background around the subject or replacing parts of the subject by the new background. The new background is created and transmitted only once. For each frame, only the subject and the camera motion are transmitted. Inserting the subject to a background frame typically takes into account the camera motion: The camera motion causes background motion that is typically applied to the virtual background. For example, the system of the present invention may be operative to accommodate a camera""s rotation around any rotation axis.
Described herein are possible methods for an alternative background creation and real-time background replacement. Including alternative background creation, the process of separating the subject from the background and the blending of the subject with the new, virtual background. The system operation preferably includes the following steps: creating the virtual background and transmitting it, marking the subject in the first frame, either manually or by using some automatic criteria, and performing the following operations for each frame: detecting the background motion and transmitting it, tracking the subject in the current frame and transmitting the subject and blending the subject into the new background, typically performed by the receiver.
There are several approaches for the alternative background creation. One possible method is to use a given image or movie. Another alternative is to create a virtual 3-D environment on which the image (movie) is projected as a texture. The software typically provides the tools for the virtual 3-D world construction and for the texture mapping (static or dynamic texture). This possibility requires more user work, but typically enables complex virtual backgrounds and free camera motion.
The above Possibilities require images and/or movies. One can use ready-made images (movies) or obtaining synthetic ones using a texture generator. A texture""s generator learns the statistics and/or structure of a given texture from an image or from a sequence of images and generates new images of the same type. The same approach can be implied on movies, as described e g. in L.-Y. Wei, M. Levoy, xe2x80x9cFast texture synthesis using tree-structured vector quantizationxe2x80x9d, SIGGRAPH 30-08-2000. Using the generator causes variegation of a specific type images e.g. waterfalls or forest images. Since the generator learns mainly statistics it is especially useful for natural phenomenon.
For broadcasting, virtual background is typically transmitted only once. The location of the subject in the background and the location of the first image in the virtual background is typically set in advance. This location changes with the motion of the physical camera.
The system preferably has the ability to cut the subject from the original movie, and reliably paste and blend it with, the new background. Cutting the subject relies on the ability to identify and track it along the image sequence.
An initial denoting of the subject can be done manually (e.g. using the computer""s mouse) or automatically, using some criteria. We use motion detection and manual marking, but other criteria are possible. Using motion detection is to assume that the subject moves before the transmission starts to enable the system""s identification. Separating a moving subject from the original background is done using a motion tracker that identifies the location and the shape of the subject in each frame. Using motion as subject identification is not equivalent to assuming that the subject moves in all frames. After the subject moves once, the system can detect its location in each frame, even if it does not move any longer.
For a moving camera, the motion of the background is accurately computed. Any inaccuracies result in a xe2x80x9cfloatingxe2x80x9d subject in front of the new background. There are many methods to compute the background motion, e.g. the method described in Rosenberg, Y. and Werman, M. xe2x80x9cReal-time object tracking from a moving video camera: a software approach on a PCxe2x80x9d, IEEE Workshop on Applications of Computer Vision, Princeton, Oct. 1998, pp. 238-239. The Rosenberg-Werman method is particularly suited for rotating camera applications.xe2x80x9d The Rosenberg-Werman reference refers to real-time motion detection, tracking and background motion computation using a standard PC although analysis of the background motion is limited to scenes that contain enough information (the system would not recognize a rotation of the camera for a subject in front of a smooth and empty wall). The methods described assume that most of each frame is the background.
Videoconference applications typically do not transmit the entire original frame: only the subject, its location in the image and the camera motion (background motion) parameters are typically transmitted. The location of the subject in the virtual frame can be set in a pre-defined location (e.g. in the enter) or according to its location in the original frame.
Inserting the subject into the new movie involves placing the subject relative to the background and naturally blending the subject with the background.
Real camera motion implies a background motion that typically leads to the same motion of the virtual background. The accuracy of the motion estimation is crucial to the reliability of the result. A wrong motion of the virtual background causes the subject xe2x80x9cto floatxe2x80x9d in front of the background. We use the method described in the above-referenced Rosenberg-Werman publication to accurately compute the background motion, but other methods are possible.
The system places the subject in each frame relative to its location in the original frame. It is possible, for a static camera, to place the subject in a fix location (e.g. always in the center of the frame).
Blending the subject with the background should look natural and should preferably overcome possible errors in cutting the subject from the original frame, as described e.g. in Burt and Adelson, xe2x80x9cA multiresolution spline with application to image mosaicsxe2x80x9d, ACM Transactions on Graphics, 2(4), pp. 217-236, October 1983.
There is thus provided, in accordance with a preferred embodiment of the present invention, a background replacement method and system for processing an image sequence representing a scenario having a first portion to be replaced and a second moving portion, at least a portion of whose motion is to be retained, the method including providing a first image including a first portion to be replaced and a second moving portion, providing a distinction between the first and second portions, and providing a new image in which at least a portion of the motion of the second portion is retained and the first portion is replaced with new image content.
Further in accordance with a preferred embodiment of the present invention, the method and system also include providing a second image in the image sequence and repeating the distinction providing and new image providing steps for the second image.
Still further in accordance with a preferred embodiment of the present invention, the distinction between the first and second portions of the second image is provided by tracking the second portion from at least the first image to the second image, e.g. by comparing several images in the vicinity of the first image in order to track from the first image to the second image.
Additionally in accordance with a preferred embodiment of the present invention, the first portion in the second image is defined as all portions of the second image which are not included in the second portion of the second image, tracked from the second portion of the first image.
Still further in accordance with a preferred embodiment of the present invention, the second moving portion includes a portion of the image having at least one subportion which moves in at least one portion of the scenario.
Further in accordance with a preferred embodiment of the present invention, the new image providing step includes transmitting the new image to a remote location and/or displaying the new image.
Further in accordance with a preferred embodiment of the present invention, the step of providing a distinction includes receiving a user""s indication of the location of the second portion in the first image.
Still further in accordance with a preferred embodiment of the present invention, the method and system also include automatically improving the user""s indication of the location of the second portion.
Further in accordance with a preferred embodiment of the present invention, the step of automatically improving includes automatically searching for the second portion, adjacent the user""s indication of the location of the second portion, in accordance with a distinguishing criterion distinguishing between the first and second portions.
Still further in accordance with a preferred embodiment of the present invention, the step of providing a distinction includes automatically distinguishing the second portion from the first portion in accordance with a distinguishing criterion distinguishing between the first and second portions.
Still further in accordance with a preferred embodiment of the present invention, the distinguishing criterion includes a color criterion.
Additionally in accordance with a preferred embodiment of the present invention, the distinguishing criterion includes a motion criterion.
Still further in accordance with a preferred embodiment of the present invention, the distinguishing criterion includes a textural criterion.
Further In accordance with a preferred embodiment of the present invention, the automatic searching step includes detecting edges adjacent the location of the second portion as indicated by the user.
Still further in accordance with a preferred embodiment of the present invention, the automatic searching step includes detecting a contour adjacent the location of the second portion as indicated by the user.
Additionally in accordance with a preferred embodiment of the present invention, the steps of providing the first and second images respectively include employing a moving e.g. pivoting camera to generate the first and second images.
Still further in accordance with a preferred embodiment of the present invention, the step of providing a new image includes the steps of estimating motion parameters of the moving camera quantifying motion of the moving camera between the first and second images, providing first new image content for the first image, and generating second new image content for the second image by applying the motion parameters to the first new image content.
Also provided, in accordance with another preferred embodiment of the present invention, is a background replacement system for processing an image sequence representing a scenario having a first portion to be replaced and a second moving portion, at least a portion of whose motion is to be retained, the system including an image source providing a first image including a first portion to be replaced and a second moving portion, an image analyzer providing a distinction between the first and second portions, and an image replacer providing a new image in which at least a portion of the motion of the second portion s retained and the first portion is replaced with new image content.
Replacing the second portion can be performed by tracking the motion of the face and limbs and inducing their motion to the replacement of the second portion. The replacement content is preferably prepared in advance and has motion parameters corresponding to the model of the original object. Having corresponding parameters need not require having the same number of limbs. Instead, for example, logical rules can be formulated to establish a correspondence between motion of one specific limb in the image content being replaced, and between motion of several limbs in the replacing image content.
A suitable method for tracking and animating of facial features is described in F. I. Parke, K. Waters, Computer facial animation, A. K. Peters Ltd., 1996. Methods for tracking of humans and animating of human and human-like creatures is described in chapters 9 and 10 and elsewhere of N. Magnenat-Thalmann and D. Thalmann, Computer animation theory and practice, Springer-Verlag, Tokyo, 1985.