The present invention relates to the fields of data compression, transmission, decompression, storage and display for graphic images such as film, video (television) and other moving picture image sequences. In particular, the present invention relates to systems for compressing and decompressing moving picture image sequences by an asynchronous, non frame-based technique. The system for compressing is reductive, i.e. xe2x80x9clossyxe2x80x9d. The xe2x80x9clossinessxe2x80x9d is adjustable and can be tailored to suit factors such as available bandwidth, available storage capacity or the complexity of the image.
There has been slow progress in uniting the world of video and film with the power of the computer so that motion picture imagesxe2x80x94especially live videoxe2x80x94can be quickly transmitted to users within a computer network. The advent of the computer network has brought forth tremendous communications capability. Where computers were once seen only as whirring number crunchers and processing machines, they are now also seen as potential vehicles for entertainment, advertising, information access and communication. The potential of video technology holds tantalizing opportunities for businesses, entrepreneurs and the public at large. In the workplace, the ordinary PC computer, a fixture on most office desks, could better maximize business resources with video conferencing and other interactive communications that link one worker or working group to another. Intraoffice computer networks could provide training, demonstrations, reports and news through broadcasts using one centralized computer to send live or taped video to workstations within the office or to linked office and customer sites. Previously, live visual communication links were not thought feasible without specialized video or television equipment.
The establishment of the Internet and its World Wide Web has also created demand for increased use of motion pictures in computer applications. Businesses see the Internet""s vast network potential as a boon for interactive communications with the public at large. Entrepreneurs have envisioned and have even attempted live, on-line broadcasts of news, concerts and other events; attempts frustrated by the current limitation of real-time computer video technology. Further, as more people communicate via the World Wide Web, there is a natural incentive to create polished information access sites. Internet users come steeped in the heritage of television, movies and other forms of highly produced motion picture entertainment. These users imagine communicating with that same clarity, expediency and visual power and have come to expect such standards.
The potential for such real-time video communications exists, but until this point there has been great difficulty in transmitting motion picture image sequences, live video (television) and previously recorded film and video through the computer. The limitations on computer speed, memory and disk storage have expanded enough to make the storage of digitized film and video clips possible. However, the inordinate amount of data that must be transmitted to display a digitized moving picture sequence on the computer has been one factor preventing the widespread use of video and film in real time applicationsxe2x80x94especially those in which speed is imperative, like video conferencing, live news feeds and live entertainment broadcasts.
The data problem pertains to the nature of the digital computer and network hardware, the method by which a computer generates images and the processing that is needed to handle the many, many images that make up a motion picture sequence. Since its invention, motion picture technology has followed a process of presenting a rapid sequence of still images to give the impression of motion to the eye. A film is essentially a xe2x80x9cflip bookxe2x80x9d of still camera photographs (i.e. frames) stored on a long strip used for playback through a projector. Current video technology follows the same frame-based concept as film, with some variation. A video camera rapidly collects a sequence of light images by scanning in horizontal movements across a light sensitive device and outputting a stream of xe2x80x9cbroadcast linexe2x80x9d data which describes the image. Typically, a camera scans every other available line on the light sensitive device and alternates between line sets (odd and even) to create two, one-half frame xe2x80x9cfieldsxe2x80x9d which, when interlaced, form a full-frame image. Video has typically been recorded by video camera in analog format, but cameras which can record video in digital format are available. To transmit analog video via a computer, each frame or field input to the computer must be converted into a digital format or xe2x80x9cdigitizedxe2x80x9d for use. A computer screen is made up of thousands of pixelsxe2x80x94programmable light units which can be instantly set and reset to emit light in one of the multitude of colors supported by the computer system. Typical monitors (ranging from 12-21 inches on the diagonal) contain matrices having resolutions of e.g. 640xc3x97512, 1,024xc3x97820, 1,280xc3x971,024 and 1,600xc3x971,280 pixels organized into rows of pixels stacked upon rows of pixels.
Each pixel in the screen display requires a color assignment from the computer to construct an image. Computer display controllers contain a large memory space, called a bitmap memory, which allocates an amount of memory for each pixel unit on the screen, e.g. 640xc3x97512, 1,024xc3x97820, 1,280xc3x971,024, etc. (Other screens which process and work on displays in background have the same size can also be defined in the bitmap memory.) The computer drives the monitor and creates images via the bitmap memory, writing pixel color assignments to its memory locations and outputting signals to the monitor based on those assignments. The digitization process creates a set of digital pixel assignments for each frame or field of video input.
During video capture a computer executes an analog-to-digital xe2x80x9cAIDxe2x80x9d conversion processxe2x80x94reading the provided film or video data (using specialized xe2x80x9cframe grabberxe2x80x9d hardware) and transforming the analog data into a stream of digital color codes, i.e. a bitmap data set for each frame or field of the motion picture. The data size of digital video stream depends upon the resolution at which the video was digitized. Resolution depends upon factors such as: i) frame resolution or frame size; ii) color depth; and iii) frame rate.
Frame resolution, or frame size, is the size in pixels of each digitized frame bitmap. Frame size does not need to be directly related to the monitor resolution in any computer configuration. Thus, while a monitor may have a resolution of 640xc3x97512 or 1,024xc3x97820, for example, a video can be digitized with a different resolution, such as 320xc3x97240. Video following the National Television Standards Committee (NTSC) standard for analog resolution digitizes to frames of 640xc3x97480, 320xc3x97240, 160xc3x97120 or other resolutions. Such video could well be displayed on a computer having a monitor resolution of 1,280xc3x971,024 or other resolution.
Color depth specifies the number of bits used by the digitizer to describe the color setting for each pixel of a digitized frame bitmap. Computer pixel units typically output color following one of several color-generating systems. RGB (Red, Green, Blue) is one system which permits all the colors of an available palette to be expressed as combinations of different amounts of red, green and blue. Red, green and blue light elements or xe2x80x9ccolor channelsxe2x80x9d are considered primary and can be blended according to color theory principles to form other colors. Electron guns fire beams to activate each of the light elements to different degrees and form colors that make up an image. The pixel assignments written to the bitmap memory control the settings used in the monitor to output colors using the pixels.
Computers vary greatly in the range of colors they can support, the number often depending on the size of the bitmap memory (an expensive item) and the size of the memory space dedicated to each pixel in the bitmap. Color systems that support a palette of 256 (or 28) different colors allocate 8 binary bits (or one byte) to each pixel in the bitmap memory and make pixel color assignments by writing 8-bit numbers to those locations. Such systems are said to provide xe2x80x9c8-bitxe2x80x9d color. More advanced systems support palettes of 65,536 (or 216) or 16,777,216 (or 224) colors and hence allocate either 16 or 24 bits (two or three bytes) per pixel in the bitmap memory. These systems are said to provide xe2x80x9c16-bitxe2x80x9d or xe2x80x9c24-bitxe2x80x9d color. A 24-bit color system is said to display in xe2x80x9ctrue color,xe2x80x9d or in as many colors as the human eye can discern. Video can be digitized to follow an 8-bit, 16-bit or 24-bit or other format. In the digitizing process, it is not necessary that the digitized video use the color format of the displaying computer. For example, it is possible using analog-to-digital conversion software to digitize a video in 16-bit color and display the video on a computer configured for 24-bit color. Most computers supporting color video have software available to make such translations.
Finally, frame rate is the speed at which the camera captures the video frames. Motion picture sequences give the impression of movement when images are displayed at a rates of more than 12-15 frames per second. Video cameras following the NTSC standard used in the United States output at 30 frames per second or 60 fields per second. Many frame grabbers can capture and digitize analog video at real time motion speeds of 30 frames a second. However, many frame grabbers digitize at lower speeds, such as at 15 frames per second. If the computer system depends on a frame grabber with a low frame processing speed, then frame rate would also be tied to the frame grabber""s processing rate.
Using the variables of frame size, color depth and frame rate it is possible to make calculations showing the speed at which digitized video in a bitmap form flows into the memory of the processing computer. Video digitized at a relatively small 320xc3x97240 picture size, with 24 bit (3 byte color depth) and a frame rate of 15 frames/seconds (sampling every other video frame) requires approximately 207 megabytes (Mb) of storage per minute. A video sequence digitized at a 640xc3x97480 frame size, a 24 bit (3 byte) color depth and a 30 frames/second rate would require approximately 1.54 gigabytes (Gb) of storage per minute of video. Both requirements clearly choke the disk storage capacity available on most commercially available hard drives which provide on the order of 1 Gb of space in total. Further, even if the processor available on the computers could feed the data for transmission directly to a remote terminal, the transmission capacity (i.e. the xe2x80x9cbandwidthxe2x80x9d) of most communications systems used today are not capable of handling such a data flow in real time.
Commercially available modems can transfer data at rates of e.g., 28,000 baud, which translates roughly to 28,000 bits (3500 bytes) per second or approximately 2 Mb per minutexe2x80x94clearly not sufficient capacity to handle the 207 Mb per minute or the 1.54 Gb per minute requirements outlined above. An Integrated Services Digital Network (ISDN) connection provides greater transmission capability than most commercially available modems but still does not provide the capacity necessary for transmitting streams of video in bitmap data form. A typical ISDN Internet connection transfers data at rates approaching 128 kilobytes (Kb) per second (approximately 5.6 Mb per minute). Local area networks (LANs) have data rates that vary depending on the size of the LAN, the number of users, the configuration of the LAN system and other factors. Although LAN transmission rates widely vary, a typical Ethernet system transfers information at a rate of 10 Mb/sec. Faster Ethernet systems can transfer information at a rate of 100 Mb/sec.
The large amount of space required by digitized video data in bitmap form makes it largely impossible to make real time transmissions of such data given the current bandwidth of most network systems. Thus, researchers have searched for ways to xe2x80x9ccompressxe2x80x9d bitmap dataxe2x80x94encode the data differently so that it will take up less space but still yields the same images. Compression algorithms reduce the amount of data used to store and transmit graphic images, while keeping enough data to generate a good quality representation of the image.
Data compression techniques are either xe2x80x9closslessxe2x80x9d or xe2x80x9clossy.xe2x80x9d A lossless compression system encodes the bitmap data file to remove redundancies but loses none of the original data after compression. A bitmap file which is compressed by a lossless compression algorithm and thereafter decompressed will output exactly as it had before it was compressed. Runtime length encoding (RLE) and LZW (Lempel-Ziv-Welch) encoding are examples of lossless encoding algorithms.
Lossless data compression techniques are useful and achieve compression ratios in ranges typically from 2:1 to 3:1 on average and sometimes greater. To achieve higher compression ratios such as 30:1, 40:1 or 200:1 (for video) and higher it may be necessary to use a xe2x80x9clossyxe2x80x9d data compression algorithm. Lossy schemes discard some data details to realize better compression. Although a lossy data compression algorithm does lose pixel data within an image, good lossy compression systems do not seriously impair the image""s quality. Small changes to pixel settings can be invisible to the viewer, especially in bitmaps with high picture frame resolutions (large frame sizes) or extensive color depths.
Frame-based image data, such as film or video, is an excellent candidate for compression by lossy techniques. Within each image it is possible to remove data redundancies and generalize information, because typically the image is filled with large pixel regions having the same color. For example, if a given pixel in a digitized image frame was set to the color red, it is likely that many other pixels in the immediate region also will be set to red or a slight variation of it. Compression algorithms take advantage of this image property by re-encoding the bitmap pixel data to generalize the color values within regions and remove data code redundancies. Such compression is called xe2x80x9cspatialxe2x80x9d or xe2x80x9cintraframexe2x80x9d compression.
A second type of compression, xe2x80x9ctemporalxe2x80x9d or xe2x80x9cinterframexe2x80x9d compression, relies on the strong data correlations that exist between frames in a motion picture sequence. From frame to frame the images are nearly identical with only small changes existing between frame images. Where one frame is already described, it is possible to describe the next frame by encoding only the changes that occur from the past frame. A frame compressed by temporal or interframe compression techniques contains only the differences between it and the previous frame; such compression can achieve substantial memory savings.
Reduction of bitmap data using either intraframe (spatial) or interframe (temporal) compression techniques facilitates the efficient storage and transmission of the otherwise massive bitmap data that makes up a digitized video transmission sequence. Currently, there are several commercially available algorithms (available as software and hardware tools) for compression and decompression of video.
The standard promulgated by the Motion Picture Experts Group and known as xe2x80x9cMPEGxe2x80x9d (with its variants MPEG-1 and MPEG-2) is one lossy technique widely used for film and video compression. MPEG-1 was originally developed to store sound and motion picture data on compact discs and digital audio tapes. MPEG standard compression uses both intraframe and interframe compression. An MPEG compression algorithm compresses a stream of digitized video data into three types of coded frames: I-frames, P-frames and B-frames. I-frames are single, stand alone frames which have been compressed by intraframe (spatial) reduction only. An I-frame can be decompressed and displayed without reference to any other frame and provides the backbone structure for the interframe compression. According to the Encyclopedia of Graphic File Formats (second edition) at p. 608, an MPEG data stream always begins with an I-frame. In typical operation, MPEG creates other I-frames every twelve or so frames within a video sequence.
P-frames and B-frames are frames which have been compressed using interframe (temporal) compression techniques. MPEG supports the elimination of temporal redundancies in a bi-directional fashionxe2x80x94an MPEG standard system will encode a difference frame based on comparison of that frame to the previous frame of video data and/or the next frame of video data. A P-frame contains data showing the differences occurring between it and the closest preceding P- or I-frame. A B-frame encodes change values found between that frame and the two closest I- or P-frames (in either direction, forward or backward) to that frame.
For all the advancement that MPEG brings to the field, it has not been widely implemented for video conferencing and other live video transmissions. While MPEG decompresses in real time, its compression algorithm is time-consuming even when implemented in hardware. Moreover, most implementations require a user to select a skeletal sequences of I-frames, a time-consuming process which all but limits most MPEG compression applications to non-real time settings. An MPEG-2 standard has been more recently developed for use in the television industry. MPEG-2 for example, handles interlaced video formats and provides other features specific to the television industry.
ClearVideo compression by Iterated Systems is another lossy compression system currently available which provides both spatial and temporal compression of video. Like MPEG-1 and MPEG-2, ClearVideo compression also compresses on a frame-by-frame basis and compresses using a selection of xe2x80x9ckey framesxe2x80x9d (similar to I-frames) and xe2x80x9cdifference framesxe2x80x9d (similar to P- and B-frames). Using fractal compressionxe2x80x94a mathematical process of encoding bitmaps as a set of mathematical equations that describe the image in terms of fractal propertiesxe2x80x94for its encoding of still images, Iterated Systems states that it requires less key frames than its competitors, which results in smaller, more efficient files an requires less bandwidth to transmit.
Again, for all the promise and advancement ClearVideo compression offers, the system is not well suited for real time transmission of video images. While a Clear Video system may compresses well and allow for decompression in real time, it has limited utility for video conferencing and other live applications in its current implementation because its compression technique is slowxe2x80x94taking up to 30 seconds per frame, even when the compressing processor is a high-end Pentium(trademark)-type processor. Such a compression time is unacceptable for real time applications.
Thus, there is a need for an advanced system for real-time compression, transmission and decompression of video images, one that operates in real-time and within the constraints of computers that are used by the public and in the workplace. Such a system would provide rapid, real time processing of incoming video images and compress those images into a data stream that is easily and quickly transferrable across available networked communications systems. It would also be necessary that the compressed data be easily decompressed by a receiving computer and used to generate a high quality image. Such an advance would pave the way for real-time communications like those desired by the business and private users alike. Such an advancementxe2x80x94an easy format in which to store data more compactly than MPEG, ClearVideo or other available video compression techniquesxe2x80x94would also lead to better ways to store and access video data.
The present invention provides a meshing-based system and method for motion picture compression, decompression, transfer, storage and display which is capable of real-time processing. The invention is particularly suited for applications such as video conferencing and other applications where real time capture and storage or transmission of video data is needed. The system of the present invention is lossy, in that a geometric mesh structure which achieves good compression replaces the multitude of pixel values or other picture-making elements that make up a digitized image. However, the lossiness of the meshing system is easily adjustable and can be varied to suit factors such as available bandwidth, available storage capacity or the complexity of the image. With the system and method of the present invention compression ratios of on the order of 100:1 or higher are possible for real-time applications using available computer hardware.
To gain such compression, the present invention provides a technique for representing a motion picture sequence that is removed from the frame-based approach traditionally used to capture and process motion picture information. As described above, video technology is synchronous and frame-basedxe2x80x94meaning that most video devices supply and store a frame of image data for each video frame in the motion picture sequence. Thus, for typical compression systems currently available, there is a one-for-one synchronous approach taken in accordance with the frame-based nature of motion pictures.
The present invention breaks with that tradition and uses an asynchronous, non frame-based meshing technique to compress video data more swiftly and more compactly than the frame-based systems currently available. The system of the current invention constructs a model from the picture-making elements available to the computer. In the exemplary embodiment, the system of the current invention constructs the model using selected pixel points from the digitized video frames. However, it is understood that in addition to pixel point values, the system of the present invention could use other picture data in the model such as wavelets, Fourier components or IFS maps. The system builds the model by inserting the picture elements into a model structure and updates the model by changing picture elements (adding new picture elements or deleting old elements) so that the model reflects the current image of the motion picture sequence at any given instance. Using the mesh modeling system, the present invention does not need to represent video as a sequence of image frames, but can instead represent the video by a single model which is continuously updated by point addition or removal. A sequence of simple commands to add or remove image elements adjusts the model so that it reproduces the motion picture sequence.
In an exemplary embodiment, the present invention uses a triangulated polygonal mesh as the model structure. Traditionally, triangulated mesh constructions have been used to create computer models of objects and surfaces, typically in 3D. In those applications, a 3D object modeling system uses a set of 3D spatial (X, Y, Z) coordinates to create a xe2x80x9cwirefamexe2x80x9d mesh structure made up of interconnected, irregular triangles that describe the surface planes of the object. A 3D object modeling system builds the object model by connecting lines between the selected data points to form the triangles. Each triangle in the model represents a plane on the surface of the object.
The Co-Pending Application (which has been expressly incorporated by reference herein) shows that it is possible to incorporate color data and spatial data into a single triangulated mesh construction. For the creation of 3D object models, the Co-Pending Application describes a system that can merge spatial X, Y, Z, values with corresponding color values (such as RGB pixel settings) and use. those combined 6D (X,Y,Z,R,G,B) values to construct a mesh model which reflects both the spatial forms of the object and its surface details. In one embodiment of that system the computer adds points incrementally to a basic, initial mesh construction and increases detail of the model by adding additional points. The computer adds points based on the significance of the point in terms of contributing either spatial or color detail.
In the Co-Pending Application, it is also noted that the technique of creating mesh constructions for 3D objects using both spatial and color values can also be used to create mesh constructions for 2D images. In applying the 3D technique directly to the problem of modeling 2D images, it can be seen that the bitmap data, i.e., the x, y and RGB pixel data, from a 2D image is very much analogous to the 3D image data that would be available from a flat, planar object marked with many surface details. The set of xe2x80x9c5Dxe2x80x9d x, y, R, G, B pixel values which make up a bitmap image would largely correspond to the 3D values for the planar object. Thus, just as a surface of a 3D object could be represented in a set of colored triangles, 2D images can also be represented as a series of colored triangles. The triangle mesh provides the structure for that image in a way that dramatically reduces the amount of data needed to create a high quality representation of the image.
The present invention expands upon the teaching of the Co-Pending Application by applying the meshing technique to motion picture sequences. A computer compressing by the system and method of the present invention creates an image model using the pixel point data from the initial digitized field of video data, selecting pixel points which are most significant in describing the image and inserting them into the mesh. The compressing system then updates that model by adding and removing points from the mesh. For a video transmission, such as video conferencing, a sending and receiving computer both maintain image models. The sending computer processes the data to compress it as described above and then transmits to the receiving computer a sequence of encoded ADD and REMOVE commands. The commands provide information so that the receiving computer can maintain a triangulated mesh that is an exact copy of the mesh at the sending computer. Based on this model, the receiving computer outputs a display of the motion picture image.
As the sending computer captures and digitizes video (such as a live video feed), an add function scans the bitmap data input by the frame grabber and determines which points from that frame should be added (following a process to locate bitmap data points which would add significant detail to the mesh). The add function then inserts the points into the model and outputs an ADD command to the receiving computer so that it can update its mesh accordingly (as described below). To locate points of significance the add function orders all the points of the new frame in terms of their significance in adding new detail to the existing model through a process which evaluates the color of each new data point in relation to the color of the same point currently in the model. Through this ordering process, the points which effect the image most are discovered and added to the model immediately.
The second process is the remove function which, like the add function, scans data input from each new digitized video field. However, unlike the add function, the remove function determines which points must be removed from the current model by establishing that they no longer apply to the current image. In the {fraction (1/30)} of a second that exists between the input of data from each field the present invention, configured with the add and remove functions, can make point insertions on the order of magnitude of 1000 point insertions per interval (on currently available hardware) and any number of point deletions per interval. However, the number of point insertions and deletions made can be tailored to suit the desired image quality or the available bandwidth of the transmission system.
The addition and removal of points to and from the mesh creates corresponding changes to its structure. Adding a point also adds additional triangles. Deleting a point removes triangles. The addition and removal procedures will also cause related changes to the structure and configuration of the mesh in the areas around where the point addition or removal occurs. In mesh building, it is an aspect of the present invention that it follow a procedure to optimize the construction of structure throughout each point addition or deletion. Although the computer can be configured to optimize the mesh structure by many different procedures, in the exemplary embodiment the present invention optimizes by the principles of Delaunay optimization. When the triangulation follows Delaunay principles, a circumcircle defined by the vertices of a triangle will not contain another data point of the mesh. When the triangle in question does include another point within its circumcircle, that configuration must also be configured by xe2x80x9cflippingxe2x80x9d the common edge that exists between the two. The Delaunay triangulation optimality principle helps to insure that the mesh of irregular triangles maintains a construction of relatively evenly sized and angled triangles. It is currently recognized as one sound process for optimizing triangulated mesh constructions. The modeling process uses the add and remove functions with Delaunay principles as explained in further detail below.
The remove function works to update the mesh model at the sending computer and outputs REMOVE commands to the receiving computer. It is an aspect of this invention that the computer at the sending location specially encodes each ADD and REMOVE command so that each are in a very compact form before being sent to the receiving computer. Each ADD or REMOVE command contains information about the intended operation, e.g., xe2x80x9cADD x, y R G Bxe2x80x9d. However, before each function transmits a command, it first encodes the command (in the process described below) so that it takes up less space.
The receiving computer accepts each encoded ADD and REMOVE command and then outputs a display. The receiving computer also uses the model information to output the motion picture display. It is an aspect of the invention that it does not generate an entire new frame each time the images need to be updated. Instead, the present invention draws locally. Using the mesh model the computer draws (and redraws) triangles only as necessary to update the image. When a point is inserted or deleted the adding or deleting procedure will require an adjustment of the triangles that exist in that region of the mesh. To maintain the display after each addition or deletion, the present invention redraws the triangles which have been affected by the point addition or deletion. Since many triangles in the mesh are not affected, they do not need to be redrawn.
Using functions like Gouraud shading, the present invention can quickly render an image based on these triangle settings. The image shifts as the computer updates the triangles, thus making a motion picture display.
The system presented employs computer equipment, cameras, a communications system and displays in the exemplary embodiment, as well as computer programmed elements to control processing. The elements of the system are detailed in the description below.