Throughout history various systems have been employed for communicating messages over short distances. Optical telegraphs such as for instance smoke signals, beacons and semephore networks date back to ancient times. Of course, such systems require a direct line of sight between the communicating parties and are effective only over relatively short distances. With the emergence of the electrical telegraph in the 1800's the transmission of communication signals between two parties, even over great distances, became far more practical and cost effective. More recently, with the development of two-way radio communication systems and analog or digital telephone networks, it has become more-or-less a routine matter to communicate with one or more parties that are located virtually anywhere in the world.
Unfortunately, most current systems that support communication over large distances are somewhat limited in that they do not include a visual-communication component. This results in the disadvantage that visual cues including body language, facial expressions and gestures are not conveyed between the communicating parties. Such visual cues are an important and often unconcious aspect of communication between humans. Without these familiar visual cues it is more difficult for one to interpret accurately another person's reactions, moods and sincerity.
This limitation is well recognized, and since at least the 1960's there has been an ongoing effort to develop practical ways of including a visual-communication component in addition to audio communication between parties. In fact, this goal has been achieved, with varying degrees of success, using videoconferencing technology and videophones. A videoconference is a set of interactive telecommunication technologies which allow two or more locations to interact via two-way simultaneous video and audio transmissions. The core technology used in a videoconference system is digital compression of audio and video streams in real time. The other components of a videoconference system include: video input i.e. a video camera or webcam; video output i.e. a computer monitor, television or projector; audio input i.e. microphones; audio output i.e. usually loudspeakers associated with the display device or telephone; data transfer i.e. analog or digital telephone network, LAN or Internet.
Simple analog videoconferences could be established as early as the invention of the television. Such videoconferencing systems consisted of two closed-circuit television systems connected via cable, radiofrequency links, or mobile links. Attempts at using normal telephony networks to transmit slow-scan video, such as the first systems developed by AT&T, failed mostly due to the poor picture quality and the lack of efficient video compression techniques. It was only in the 1980s that digital telephony transmission networks became possible, such as ISDN, assuring a minimum bit rate (usually 128 kilobits/s) for compressed video and audio transmission. Finally, in the 1990s, IP (Internet Protocol) based videoconferencing became possible, and more efficient video compression technologies were developed, permitting desktop, or personal computer (PC)-based videoconferencing.
It is worth noting at this point that businesses and individuals have been slow to adopt IP-based videoconferencing despite the many advantages, even as high-speed Internet service has become more widely available at a reasonable cost. This failure is due at least in part to the typically uncomfortable experience that is associated with IP-based videoconferencing. In particular, often the video component is of poor quality and “choppy” or not precisely synchronized with the audio component of the communication. Rather than enhancing communication, the video component may actually provide false visual cues and even disorient or nauseate those that are party to the communication. Of course, wider adoption is likely to occur when the video-component is improved sufficiently to provide more natural motion and life-like representation of the communicating parties. Accordingly, each incremental improvement in the encoding and transmission of video data is an important step toward achieving widespread adoption of videoconferencing.
A more recent development, which is related closely to videoconferencing, is telepresence. Telepresence refers to a set of technologies which allow a person to feel as if they were present, to give the appearance that they were present, or to have an effect, at a location other than their true location. A good telepresence strategy puts the human factors first, focusing on visual collaboration solutions that closely replicate the brain's innate preferences for interpersonal communications, separating from the unnatural “talking heads” experience of traditional videoconferencing. These cues include life-size participants, fluid motion, accurate flesh tones and the appearance of true eye contact. In many telepresence applications there is an implicit requirement for high-resolution video content.
A major obstacle to the widespread adoption of videoconferencing and telepresence is the need to transmit consistently and in real time a large amount of video data between two or more remote locations via a communications network. As a result, video encoding techniques are used to reduce the amount of video data that are transmitted. For instance, MPEG algorithms compress data to form small data sets that can be transmitted easily and then decompressed. MPEG achieves its high compression rate by representing only the changes from one frame to another, instead of each entire frame. The video information is then encoded using a technique called Discrete Cosine Transform (DCT). For example, in a scene in which a person walks past a stationary background, only the moving region will need to be represented, either using motion compensation or as refreshed image data or as a combination of the two, depending on which representation requires fewer bits to adequately represent the picture. The parts of the scene that are not changing need not be sent repeatedly. MPEG uses a type of lossy compression, since some data is removed, but the diminishment of data is generally imperceptible to the human eye.
The three major picture—or frame—types found in typical video compression designs are Intra coded pictures (I-frames), Predicted pictures (P-frames), and Bi-predictive pictures (B-frames). However, for a real-time video communication only Intra (I-frames) and Predictive (P-frames) are considered. In a motion sequence, individual frames of pictures are grouped together (called a group of pictures, or GOP) and played back so that the viewer registers the video's spatial motion. Also called a keyframe, an I-frame is a single frame of digital content that the encoder examines independent of the frames that precede it; the I-frame stores all of the data needed to display that frame. Typically, I-frames are interspersed with P-frames in a compressed video. The more I-frames that are contained, the better quality the video will be; however, I-frames contain the most amount of data and therefore increase network traffic load. P-frames follow I-frames and contain only the data that have changed from the preceding I-frame (such as color or content changes). Because of this, P-frames depend on the I-frames to fill in most of the data. In essence, each frame of video is analyzed to determine regions with motion and regions that are static. When P-frames are sent, they contain data that has changed for the entire frame. Similarly, each I-frame contains data for the entire frame. Thus, both the peak and average network load is relatively high.
Modern video encoding techniques work extremely well, and are capable of achieving compression ratios in the range of 200:1 to 500:1. Unfortunately, this type of encoding is computationally very expensive and requires extremely powerful processing capabilities at the transmitting end. Dedicated videoconferencing and telepresence systems, which are cost prohibitive in most instances, do have sufficient processing capabilities and are effective for encoding high resolution video in real time. On the other hand, PC-based videoconferencing systems seldom have sufficient processing capabilities to handle video encoding operations in real time. For instance, using a modern computer with four 2-GHz cpu cores to encode high resolution video (1920×1080 pixel at 30 fps) introduces an unacceptable latency of 200 ms. Of note, the processing power that is required to decode the encoded video at the receiving end is considerably less.
Another problem that is associated with modern video encoding techniques is the high peak/average data bursts caused by sending the I-frame via the communication network. Data bursts occur initially when the videoconference begins and also at intervals throughout the videoconference. The increased network traffic can result in delays in receiving the I-frame data at the receiving end, leading to choppy video and/or packet loss. Decreasing the frequency of I-frame transmission does not decrease the peak data burst issues, and additionally degrades video quality.
It would be advantageous to provide a method and system that overcomes at least some of the above-mentioned limitations of the prior art.