1. Field of the Invention
This invention relates to the tracking of objects in the frames of a video sequence for interactive multimedia applications, interactive television, and games, for example.
2. Description of the Related Art
As computers have improved in processing power, display capability, and storage capacity, computer applications incorporating video sequences have become commonplace. One important feature for the success of these "video applications" is for a user to interact with objects in the video sequence.
For example, an educational program about marine life can incorporate a video sequence of marine life to more vividly display the various creatures of the ocean. Ideally, the program is interactive so that, when a child using a pointer on the screen selects a particular creature, the computer either states the scientific name and a short description of the creature or displays information about the creature on the screen. The computer must correlate the location of the pointer on the screen with the location of the creatures in the image to determine which creature was selected. If the video sequence is generated by the computer, then the computer should know the location of each creature. However, for realism an actual video sequence of real ocean creatures should be digitized and displayed by the computer. In this case, the computer is only given a series of images and will not know the location of each creature within the image. Consequently, a separate data structure must be created containing the location of each creature in each frame of the video sequence in order for the computer to correlate the location of the pointer to the location of the creatures to determine which creature was selected. In other programs, the location of other objects must be tracked much as the location of the ocean creatures are tracked. Typically, the interactive video application will combine the original video sequence and the location information into an interactive video sequence on a computer readable medium, such as CD-ROMS, magnetic disks, or magnetic tapes.
The advances in computing have been transplanted into the video industry in the form of microprocessor controlled "set-top boxes" for applications such as interactive television with video on demand systems. In an interactive television system, it is desirable to allow the user to select objects on the television screen and receive information regarding that object. For example, a customer may request a video about Pro Football's greatest games which will contain various video sequences of significant games. A customer will derive greater enjoyment from the video if he is able to select players using a pointer on the screen to receive additional information on the screen about the selected player. For example, the customer may be interested in the player's game statistics or career statistics. In order to determine which player has been selected, the microprocessor must be able to correlate the location of the pointer to the location of the players. Therefore, each player image must be tracked as a separate object throughout the video sequence. Furthermore, the microprocessor must receive the object data as well as the desired information about the players at the same time the video is sent to the set-top box. The interactive video sequence combining the original video and the location information can be sent to the set-top box using various transmission lines, such as co-axial cable, phone lines, or fiber-optic cabling. Alternatively, the interactive video sequence can be broadcast to the set-top boxes over the airwaves.
Unless the processor knows the location of all objects in the video sequence, the processor will be unable to discern which object the user wishes to select. Consequently, in order to provide interactive television or interactive video applications, the location of each object in each frame of the original video sequence must be tracked for the processor.
Typically, object locations are generated by having an operator manually mark objects on a computer display for each frame of the video sequence. For the marine life video the operator would first identify a particular creature to be tracked. The operator would then use a mouse or other user interface hardware to draw a rectangle around the creature on every frame of the image. The computer then stores the information on that creature for that frame. Once the operator finishes marking the frame, he must proceed to the next frame. This continues for each frame of the video sequence and for each creature on the video sequence. Alternatively, the operator can mark and label every creature on each frame before proceeding to the next frame.
Full motion video currently uses thirty frames per second so for even one minute of video 1800 frames of video are generated. Therefore it is extremely inefficient and tedious to track the objects manually. Since the objects will probably not move far between frames, some conventional systems attempt to estimate the motion of tracked objects using interpolation. For example an operator may manually track an object once for every second of the video sequence. The system then estimates the location of the object in the frames between the manually tracked frames by linear interpolation using the position of the object in the manually tracked frames. This method can produce reliable results for objects which exhibit linear motion; however, for non linearly moving objects this method is very inaccurate.
Hence there is a need for a method or system to automatically and accurately track an object through a video sequence more rapidly, more accurately, and more efficiently than conventional methods.