There has been slow progress in uniting the world of video and film with the power of the computer so that motion picture images--especially live video--can be quickly transmitted to users within a computer network. The advent of the computer network has brought forth tremendous communications capability. Where computers were once seen only as whirring number crunchers and processing machines, they are now also seen as potential vehicles for entertainment, advertising, information access and communication. The potential of video technology holds tantalizing opportunities for businesses, entrepreneurs and the public at large. In the workplace, the ordinary PC computer, a fixture on most office desks, could better maximize business resources with video conferencing and other interactive communications that link one worker or working group to another. Intraoffice computer networks could provide training, demonstrations, reports and news through broadcasts using one centralized computer to send live or taped video to workstations within the office or to linked office and customer sites. Previously, live visual communication links were not thought feasible without specialized video or television equipment.
The establishment of the Internet and its World Wide Web has also created demand for increased use of motion pictures in computer applications. Businesses see the Internet's vast network potential as a boon for interactive communications with the public at large. Entrepreneurs have envisioned and have even attempted live, on-line broadcasts of news, concerts and other events; attempts frustrated by the current limitation of real-time computer video technology. Further, as more people communicate via the World Wide Web, there is a natural incentive to create polished information access sites. Internet users come steeped in the heritage of television, movies and other forms of highly produced motion picture entertainment. These users imagine communicating with that same clarity, expediency and visual power and have come to expect such standards.
The potential for such real-time video communications exists, but until this point there has been great difficulty in transmitting motion picture image sequences, live video (television) and previously recorded film and video through the computer. The limitations on computer speed, memory and disk storage have expanded enough to make the storage of digitized film and video clips possible. However, the inordinate amount of data that must be transmitted to display a digitized moving picture sequence on the computer has been one factor preventing the widespread use of video and film in real time applications--especially those in which speed is imperative, like video conferencing, live news feeds and live entertainment broadcasts. The data problem pertains to the nature of the digital computer and network hardware, the method by which a computer generates images and the processing that is needed to handle the many, many images that make up a motion picture sequence. Since its invention, motion picture technology has followed a process of presenting a rapid sequence of still images to give the impression of motion to the eye. A film is essentially a "flip book" of still camera photographs (i.e. frames) stored on a long strip used for playback through a projector. Current video technology follows the same frame-based concept as film, with some variation. A video camera rapidly collects a sequence of light images by scanning in horizontal movements across a light sensitive device and outputting a stream of "broadcast line" data which describes the image. Typically, a camera scans every other available line on the light sensitive device and alternates between line sets (odd and even) to create two, one-half frame "fields" which, when interlaced, form a full-frame image. Video has typically been recorded by video camera in analog format, but cameras which can record video in digital format are available. To transmit analog video via a computer, each frame or field input to the computer must be converted into a digital format or "digitized" for use. A computer screen is made up of thousands of pixels--programmable light units which can be instantly set and reset to emit light in one of the multitude of colors supported by the computer system. Typical monitors (ranging from 12-21 inches on the diagonal) contain matrices having resolutions of e.g. 640.times.512, 1,024.times.820, 1,280.times.1,024 and 1,600.times.1,280 pixels organized into rows of pixels stacked upon rows of pixels. Each pixel in the screen display requires a color assignment from the computer to construct an image. Computer display controllers contain a large memory space, called a bitmap memory, which allocates an amount of memory for each pixel unit on the screen, e.g. 640.times.512, 1,024.times.820, 1,280.times.1,024, etc. (Other screens which process and work on displays in background have the same size can also be defined in the bitmap memory.) The computer drives the monitor and creates images via the bitmap memory, writing pixel color assignments to its memory locations and outputting signals to the monitor based on those assignments. The digitization process creates a set of digital pixel assignments for each frame or field of video input.
During video capture a computer executes an analog-to-digital "A/D" conversion process--reading the provided film or video data (using specialized "frame grabber" hardware) and transforming the analog data into a stream of digital color codes, i.e. a bitmap data set for each frame or field of the motion picture. The data size of digital video stream depends upon the resolution at which the video was digitized. Resolution depends upon factors such as: i) frame resolution or frame size; ii) color depth; and iii) frame rate.
Frame resolution, or frame size, is the size in pixels of each digitized frame bitmap. Frame size does not need to be directly related to the monitor resolution in any computer configuration. Thus, while a monitor may have a resolution of 640.times.512 or 1,024.times.820, for example, a video can be digitized with a different resolution, such as 320.times.240. Video following the National Television Standards Committee (NTSC) standard for analog resolution digitizes to frames of 640.times.480, 320.times.240, 160.times.120 or other resolutions. Such video could well be displayed on a computer having a monitor resolution of 1,280.times.1,024 or other resolution.
Color depth specifies the number of bits used by the digitizer to describe the color setting for each pixel of a digitized frame bitmap. Computer pixel units typically output color following one of several color-generating systems. RGB (Red, Green, Blue) is one system which permits all the colors of an available palette to be expressed as combinations of different amounts of red, green and blue. Red, green and blue light elements or "color channels" are considered primary and can be blended according to color theory principles to form other colors. Electron guns fire beams to activate each of the light elements to different degrees and form colors that make up an image. The pixel assignments written to the bitmap memory control the settings used in the monitor to output colors using the pixels.
Computers vary greatly in the range of colors they can support, the number often depending on the size of the bitmap memory (an expensive item) and the size of the memory space dedicated to each pixel in the bitmap. Color systems that support a palette of 256 (or 2.sup.8) different colors allocate 8 binary bits (or one byte) to each pixel in the bitmap memory and make pixel color assignments by writing 8-bit numbers to those locations. Such systems are said to provide "8-bit" color. More advanced systems support palettes of 65,536 (or 2.sup.16) or 16,777,216 (or 2.sup.24) colors and hence allocate either 16 or 24 bits (two or three bytes) per pixel in the bitmap memory. These systems are said to provide "16-bit" or "24-bit" color. A 24-bit color system is said to display in "true color," or in as many colors as the human eye can discern. Video can be digitized to follow an 8-bit, 16-bit or 24-bit or other format. In the digitizing process, it is not necessary that the digitized video use the color format of the displaying computer. For example, it is possible using analog-to-digital conversion software to digitize a video in 16-bit color and display the video on a computer configured for 24-bit color. Most computers supporting color video have software available to make such translations.
Finally, frame rate is the speed at which the camera captures the video frames. Motion picture sequences give the impression of movement when images are displayed at a rates of more than 12-15 frames per second. Video cameras following the NTSC standard used in the United States output at 30 frames per second or 60 fields per second. Many frame grabbers can capture and digitize analog video at real time motion speeds of 30 frames a second. However, many frame grabbers digitize at lower speeds, such as at 15 frames per second. If the computer system depends on a frame grabber with a low frame processing speed, then frame rate would also be tied to the frame grabber's processing rate.
Using the variables of frame size, color depth and frame rate it is possible to make calculations showing the speed at which digitized video in a bitmap form flows into the memory of the processing computer. Video digitized at a relatively small 320.times.240 picture size, with 24 bit (3 byte color depth) and a frame rate of 15 frames/seconds (sampling every other video frame) requires approximately 207 megabytes (Mb) of storage per minute. A video sequence digitized at a 640.times.480 frame size, a 24 bit (3 byte) color depth and a 30 frames/second rate would require approximately 1.54 gigabytes (Gb) of storage per minute of video. Both requirements clearly choke the disk storage capacity available on most commercially available hard drives which provide on the order of 1 Gb of space in total. Further, even if the processor available on the computers could feed the data for transmission directly to a remote terminal, the transmission capacity (i.e. the "bandwidth") of most communications systems used today are not capable of handling such a data flow in real time.
Commercially available modems can transfer data at rates of e.g., 28,000 baud, which translates roughly to 28,000 bits (3500 bytes) per second or approximately 2 Mb per minute--clearly not sufficient capacity to handle the 207 Mb per minute or the 1.54 Gb per minute requirements outlined above. An Integrated Services Digital Network (ISDN) connection provides greater transmission capability than most commercially available modems but still does not provide the capacity necessary for transmitting streams of video in bitmap data form. A typical ISDN Internet connection transfers data at rates approaching 128 kilobytes (Kb) per second (approximately 5.6 Mb per minute). Local area networks (LANs) have data rates that vary depending on the size of the LAN, the number of users, the configuration of the LAN system and other factors. Although LAN transmission rates widely vary, a typical Ethernet system transfers information at a rate of 10 Mb/sec. Faster Ethernet systems can transfer information at a rate of 100 Mb/sec.
The large amount of space required by digitized video data in bitmap form makes it largely impossible to make real time transmissions of such data given the current bandwidth of most network systems. Thus, researchers have searched for ways to "compress" bitmap data--encode the data differently so that it will take up less space but still yields the same images. Compression algorithms reduce the amount of data used to store and transmit graphic images, while keeping enough data to generate a good quality representation of the image.
Data compression techniques are either "lossless" or "lossy." A lossless compression system encodes the bitmap data file to remove redundancies but loses none of the original data after compression. A bitmap file which is compressed by a lossless compression algorithm and thereafter decompressed will output exactly as it had before it was compressed. Runtime length encoding (RLE) and LZW (Lempel-Ziv-Welch) encoding are examples of lossless encoding algorithms.
Lossless data compression techniques are useful and achieve compression ratios in ranges typically from 2:1 to 3:1 on average and sometimes greater. To achieve higher compression ratios such as 30:1, 40:1 or 200:1 (for video) and higher it may be necessary to use a "lossy" data compression algorithm. Lossy schemes discard some data details to realize better compression. Although a lossy data compression algorithm does lose pixel data within an image, good lossy compression systems do not seriously impair the image's quality. Small changes to pixel settings can be invisible to the viewer, especially in bitmaps with high picture frame resolutions (large frame sizes) or extensive color depths.
Frame-based image data, such as film or video, is an excellent candidate for compression by lossy techniques. Within each image it is possible to remove data redundancies and generalize information, because typically the image is filled with large pixel regions having the same color. For example, if a given pixel in a digitized image frame was set to the color red, it is likely that many other pixels in the immediate region also will be set to red or a slight variation of it. Compression algorithms take advantage of this image property by re-encoding the bitmap pixel data to generalize the color values within regions and remove data code redundancies. Such compression is called "spatial" or "intraframe" compression.
A second type of compression, "temporal" or "interframe" compression, relies on the strong data correlations that exist between frames in a motion picture sequence. From frame to frame the images are nearly identical with only small changes existing between frame images. Where one frame is already described, it is possible to describe the next frame by encoding only the changes that occur from the past frame. A frame compressed by temporal or interframe compression techniques contains only the differences between it and the previous frame; such compression can achieve substantial memory savings.
Reduction of bitmap data using either intraframe (spatial) or interframe (temporal) compression techniques facilitates the efficient storage and transmission of the otherwise massive bitmap data that makes up a digitized video transmission sequence. Currently, there are several commercially available algorithms (available as software and hardware tools) for compression and decompression of video.
The standard promulgated by the Motion Picture Experts Group and known as "MPEG" (with its variants MPEG-1 and MPEG-2) is one lossy technique widely used for film and video compression. MPEG-1 was originally developed to store sound and motion picture data on compact discs and digital audio tapes. MPEG standard compression uses both intraframe and interframe compression. An MPEG compression algorithm compresses a stream of digitized video data into three types of coded frames: I-frames, P-frames and B-frames. I-frames are single, stand alone frames which have been compressed by intraframe (spatial) reduction only. An I-frame can be decompressed and displayed without reference to any other frame and provides the backbone structure for the interframe compression. According to the Encyclopedia of Graphic File Formats (second edition) at p. 608, an MPEG data stream always begins with an I-frame. In typical operation, MPEG creates other I-frames every twelve or so frames within a video sequence.
P-frames and B-frames are frames which have been compressed using interframe (temporal) compression techniques. MPEG supports the elimination of temporal redundancies in a bi-directional fashion--an MPEG standard system will encode a difference frame based on comparison of that frame to the previous frame of video data and/or the next frame of video data. A P-frame contains data showing the differences occurring between it and the closest preceding P- or I-frame. A B-frame encodes change values found between that frame and the two closest I- or P-frames (in either direction, forward or backward) to that frame.
For all the advancement that MPEG brings to the field, it has not been widely implemented for video conferencing and other live video transmissions. While MPEG decompresses in real time, its compression algorithm is time-consuming even when implemented in hardware. Moreover, most implementations require a user to select a skeletal sequences of I-frames, a time-consuming process which all but limits most MPEG compression applications to non-real time settings. An MPEG-2 standard has been more recently developed for use in the television industry. MPEG-2 for example, handles interlaced video formats and provides other features specific to the television industry.
ClearVideo compression by Iterated Systems is another lossy compression system currently available which provides both spatial and temporal compression of video. Like MPEG-1 and MPEG-2, ClearVideo compression also compresses on a frame-by-frame basis and compresses using a selection of "key frames" (similar to I-frames) and "difference frames" (similar to P- and B-frames). Using fractal compression--a mathematical process of encoding bitmaps as a set of mathematical equations that describe the image in terms of fractal properties--for its encoding of still images, Iterated Systems states that it requires less key frames than its competitors, which results in smaller, more efficient files and requires less bandwidth to transmit.
Again, for all the promise and advancement ClearVideo compression offers, the system is not well suited for real time transmission of video images. While a Clear Video system may compresses well and allow for decompression in real time, it has limited utility for video conferencing and other live applications in its current implementation because its compression technique is slow--taking up to 30 seconds per frame, even when the compressing processor is a high-end Pentium.TM.-type processor. Such a compression time is unacceptable for real time applications.
Thus, there is a need for an advanced system for real-time compression, transmission and decompression of video images, one that operates in real-time and within the constraints of computers that are used by the public and in the workplace. Such a system would provide rapid, real time processing of incoming video images and compress those images into a data stream that is easily and quickly transferrable across available networked communications systems. It would also be necessary that the compressed data be easily decompressed by a receiving computer and used to generate a high quality image. Such an advance would pave the way for real-time communications like those desired by the business and private users alike. Such an advancement--an easy format in which to store data more compactly than MPEG, ClearVideo or other available video compression techniques--would also lead to better ways to store and access video data.