Several techniques are known for encoding (i.e., compressing) video to produce a compressed video signal and for decoding such a compressed signal. See ISO/IEC IS 13818-1,2,3:Generic Coding of Moving Pictures and Associated Audio: Systems, Video and Audio ("MPEG-2"). FIG. 1 depicts an encoder 10 and decoder 12 according to an MPEG-1, or MPEG-2 main profile encoding and decoding standard. An original digital video signal V is inputted to the encoder 10. The video signal V is organized into macroblocks. Each macroblock includes a number of luminance blocks and a number of chrominance blocks. Each block is an 8.times.8 array of pixels.
Some macroblocks are only spatially encoded by spatial encoder 14 of the encoder 10. Other macroblocks are both spatially and temporally encoded using spatial encoder 14 and temporal encoder 16 of the encoder 10. Macroblocks that are only spatially encoded are outputted directly via output A to the spatial encoder 14. (Such output is shown as achieved by a switch 13 which may be implemented using software, control signals, etc.) Temporally encoded macroblocks are inputted to the subtractor 18 of the temporal encoder 16 to produce prediction error macroblocks. Prediction error macroblocks are outputted via output B to the spatial encoder 14. The spatial encoder 14 includes a discrete cosine transformer 20 which converts each block of each macroblock to spatial frequency coefficients. The spatial frequency coefficients are quantized by a quantizer 22 (and scanned according to some predetermined "zig-zag" or "alternate scan" ordering).
In temporally encoding a macroblock, a prediction macroblock is selected for each to-be-encoded macroblock, which prediction macroblock is subtracted from the to-be-encoded macroblock in the subtractor 18. The selection of a prediction macroblock may be made as follows. One or more pictures that precede and follow a to-be-encoded picture may be designated as reference pictures from which prediction macroblocks are selected. (Herein "picture" means field or frame as per MPEG-2 parlance.) Such reference pictures are encoded and decoded using the encoder 10 (and may themselves have been temporally encoded). A search is performed in each available reference picture to identify the macroblock therein that most closely matches the to-be-encoded macroblock of the to-be-encoded picture. This best matching macroblock is then selected as the prediction macroblock for the to-be-encoded macroblock. The prediction macroblock may be located in a reference picture that precedes the to-be-encoded picture, a reference picture that follows the to-be-encoded picture or may be an interpolation of multiple available candidate prediction macroblocks, each from a different reference picture. In some pictures, called P-pictures, the prediction macroblock candidates may only originate in one or more preceding pictures (or an interpolation thereof). In other pictures, called B-pictures, the prediction macroblock candidates may be selected from a preceding or following picture (or an interpolation thereof). In yet other pictures, called I-pictures, no prediction is formed. Rather, each macroblock is only spatially encoded. (Such spatially only encoded macroblocks are sometimes referred to as "intra macroblocks", whereas motion compensated temporally encoded macroblocks are sometimes referred to as "inter macroblocks"). In addition, if no adequate prediction macroblock candidate can be found for a macroblock in a P or B picture, the macroblock may be spatially only encoded.
The temporal encoder 16 includes a decoder 24. The decoder 24 includes a dequantizer 26 for dequantizing the coefficients output by the spatial encoder 14. The dequantized coefficients are inverse discrete cosine transformed by an inverse discrete cosine transformer 28 to produce pixel values. If the decoded macroblock was only spatially encoded, the decoded macroblock may be directly stored in a picture memory 30 via output C. If the decoded macroblock is a prediction error macroblock, the appropriate prediction macroblock is retrieved from the picture memory 30 (as described below) and added to the decoded prediction error macroblock in an adder 32. The macroblock of pixels thus formed is stored in the picture memory 30 via output D. Illustratively, only decoded macroblocks of reference pictures are stored in the picture memory 30.
The selection of the prediction macroblock is achieved as follows. The next to-be-encoded macroblock of the currently encoded picture is inputted to a motion compensator 34. The motion compensator 34 also receives from the picture memory 30 pixel data of reference pictures that may be used to predict the next to-be-encoded macroblock. Illustratively, the motion compensator 34 uses a block matching technique to identify the best matching macroblock (or interpolated mac roblock) from the reference picture(s). According to such a technique, multiple candidate prediction macroblocks are extracted from the available reference pictures and compared to the to-be-encoded macroblock. Each candidate prediction macroblock is shifted temporally with respect to the encoded macroblock (because the candidate prediction macroblocks originate from a different picture in time than the to-be-encoded block) and are spatially shifted relative to the to-be-encoded macroblock in increments as low as 1/2 pixels. The candidate prediction macroblock that best matches the to-be-encoded macroblock is selected as the prediction macroblock for temporally encoding the to-be-encoded macroblock. The prediction macroblock is identified by its temporal and spatial shift, referred to as a motion vector MV. The motion vector MV is outputted from the temporal encoder 16. In addition, such a motion vector MV may be saved (e.g., in picture memory 30) for later identification of the prediction macroblock when decoding the picture in decoder 24.
The spatially encoded macroblock and prediction error macroblock coefficient data and the motion vectors MV are furthermore entropy encoded by run-level, variable length encoder 36. The data may be stored in a buffer 37 that models the occupancy of a buffer of known size at the decoder. To ensure that the decoder's buffer does not overflow or underflow, the number of bits produced per encoded macroblock or prediction error macroblock may be adjusted using a quantizer adaptor 39. (In addition, pictures may be skipped and stuffing data may be appended before the beginning of selected encoded pictures.) The compressed video signal (bitstream) thus produced is outputted via a channel (which may be a transmission medium or a digital storage medium/record carrier, such as a magnetic disk, optical disc, memory, etc.) to the decoder 12. (For sake of brevity, the encoding of audio data, and the encapsulation of the compressed video and audio signals in a system layer stream, such as a transport stream or program stream, and a channel layer format, have been omitted in this discussion.)
The decoder 12 has a buffer 33 in which the received compressed video signal is temporarily stored. The bits of the compressed video signal are outputted to a variable length, run-level decoder 38 which performs the inverse operation as the variable length encoder 36 to recover the motion vectors MV and macroblock and prediction error macroblock coefficient data. The motion vectors MV and macroblock coefficient data are inputted to a decoder subcircuit 40 which is analogous to the decoder 24. The decoder subcircuit 40 decodes the video to produce decoded video DV for presentation.
MPEG-2 also provides scalability layers. See B. HASKELL, A. PURI & A. NETRAVALI, DIGITAL VIDEO: AN INTRODUCTION TO MPEG-2, ch. 9, p. 183-229 (1997). FIG. 2 shows a spatial scalability encoder 42 and decoders 44 and 46. The spatial scalability encoder 42 may be constructed simply as follows. A video signal is inputted to a spatial low pass filter or decimator 48 to produce a lower spatial resolution version of the video signal. A lower or base layer encoder 12 encodes the low resolution version of the video signal to produce a lower layer or base layer compressed video signal LLV. The base layer compressed video signal LLV is a fully and independently decodable and presentable video signal.
Next, an enhancement layer compressed video signal ELV is formed as follows. The full resolution version of the video signal V is predictively encoded in the spatial enhancement encoder 51. However, each temporal prediction macroblock produced by the motion compensator 34 of the spatial enhancement encoder 51 is inputted to a subtractor 52. The base layer compressed video signal LLV is decoded in a decoder 12 and interpolated in a spatial interpolator 50 to the full resolution of the original video signal V. This base layer decoded video signal, which is reconstructed from the base layer compressed video signal, contains reconstructed macroblocks which are used as spatial predictors. That is, the reconstructed macroblocks are fed to the subtractor 52 where they are subtracted from corresponding temporal prediction macroblocks produced by the motion compensator 34. (The spatial prediction macroblocks may be weighted by subtractor 52 before they are subtracted from the temporal prediction macroblocks). The prediction error macroblocks thus formed are then spatially encoded as described above to form an enhancement layer compressed video signal ELV.
Note that both the enhancement layer and base layer encoders 10 and 51 are similar to that described above and both form temporal predictions. This means that a spatial scalability encoder 10,51 must have two picture memories 30, 30' (i.e., capacity to store reference pictures for performing block matching at both the base layer and enhancement layer).
Two types of decoders are permissible for the spatial scalability profile encoded video signal. A first type of decoder 44 uses a decoder 12 of similar construction as shown in FIG. 1 to decode only the base layer compressed video signal LLV to produce a lower fidelity decoded base layer video signal DVL. A second type of decoder 46 decodes both the base layer compressed video signal LLV and the enhancement layer compressed video signal ELV. A base layer decoder 12 of the decoder 46 decodes the base layer compressed video signal LLV. A spatial interpolator 50 interpolates the base layer decoded video signal to the full resolution of the original video signal V. An enhancement layer decoder 53 decodes the enhancement layer compressed video signal. An adder 54 selectively adds (weighted) reconstructed macroblocks of the interpolated, decoded base layer video signal to prediction macroblocks reconstructed from the enhancement layer compressed video signal in order to reconstruct an enhanced fidelity enhancement layer video signal DVE.
FIG. 3 shows an SNR scalability encoder 56 and decoders 58 and 60. The encoder 56 is very similar as before with the following differences. As before, the spatial encoder has a quantizer 22 which outputs quantized coefficients to a run-level, variable length encoder 36. The quantized coefficient signal i.s dequantized by a dequantizer 26. The dequantized coefficient signal is subtracted from the original coefficient signal (outputted from the discrete cosine transformer 20) in a subtractor 64. The error signal thus produced is quantized in a second quantizer 22' to produce a quantizer error signal. The quantizer error signal is run-level and variable length encoded in a second run-level, variable length encoder 36'.
The decoder 66 of the temporal encoder 68 of the encoder 56 has a first dequantizer 26 which receives the quantized coefficients outputted form quantizer 22 and dequantizes them. The decoder 66 also has a second dequantizer 22' that receives the quantized error coefficients outputted from quantizer 22' and dequantizes them. These two dequantized coefficient signals are then added together in an adder 70. The rest of the encoder 56 is the same as in FIG. 1.
The encoded signal outputted from the run-level, variable length encoder 36 of encoder 56 is a fully independently decodable base layer compressed video signal LLV. Such a signal can be received at a base layer decoder 60 which has a similar structure as decoder 12.
The encoded signal outputted from the variable length encoder 36' of encoder 56 is an enhancement layer compressed video signal ELV which can only be decoded in conjunction with the base layer compressed video signal LLV. An enhancement layer decoder 58 has two run-level, variable length decoders 38, 38' for run-level and variable length decoding the base layer compressed video signal LLV and the enhancement layer compressed video signal ELV, respectively. These decoded video signals are then fed to dequantizers 26 and 26', respectively which dequantize these signals. Adder 70 then adds the two dequantized signals together prior to inverse discrete cosine transformation. The remainder of the decoding process is similar to before.
MPEG-2 also has a data partitioning profile and a temporal scalability profile. In the data partitioning profile, the bits of selected quantized coefficients are partitioned into a low precision portion and a precision extension portion. The precision extension portion, which serves solely to distinguish close quantization coefficient levels, is formed into an enhancement layer compressed video signal, whereas the remainder of the original encoded video signal forms a base layer compressed video signal. According to the temporal scalability profile, an original video signal is decimated in time to form a lower temporal resolution video signal. The lower temporal resolution video signal is encoded in a base layer encoder similar to encoder 12. The original temporal resolution video signal, and a low fidelity decoded base layer video signal are inputted to an enhancement layer encoder. The decoded reference pictures of the low fidelity decoded base layer video signal are used in addition to the decoded pictures of the enhancement layer compressed video signal for forming predictions.
Each of the scalability layers has been proposed for purposes of providing two levels of resolution or quality using the same bitstream. Base layer decoders can only decode the base layer compressed video signal to produce a lower fidelity decoded base layer video signal. Enhancement layer decoders can decode the base and enhanced layer compressed video signals to produce an enhanced fidelity decoded enhancement layer video signal. Nevertheless, both a base layer decoder and an enhancement layer decoder can decode the same bitstream.
It is desirable to use a computer as a video communication terminal. Low cost cameras are available which can produce high quality color and monochrome digital video. The problem is that the bit rate of such digital video signals far exceeds the maximum data input bit rate of any port on a common personal computer. Conventional solutions to this problem include using a camera with a proprietary interface and video capture card that is connected to the computer bus.
U.S. patent application Ser. Nos. 08/708,388 and 08/792,683 propose alternative solutions. These applications propose a camera with a built in encoder, or an encoder adaptor for a conventional video camera. The camera with encoder, or encoder adaptor have an interface that is compliant with the Universal Serial Bus (USB) standard. See Open HCI, Universal Serial Bus Specification v.1.0 Jan. 19, 1996. FIG. 4 shows a system 100 is shown with both types of camera attachment architectures. Illustratively, the system 100 can be used in a real-time, interactive moving picture communication application, a real-time non-interactive picture communication application, a still or moving picture capture application, etc. As shown, a camera 110 is connected to a computer system 120 externally to the housing 156 of the computer system 120. The computer system 120 illustratively includes a cpu bus 122, a system bus 124 (e.g., a PCI bus) and an I/O expansion bus 126 (e.g., as ISA bus). Connected to the cpu bus 122 is at least one processor 128 and a "north" bridge or memory controller 130. The north bridge 130 connects a cache 132 and a main memory 134 to the processors 128 on the cpu bus 122. The north bridge 130 also enables data transfers between devices on the system bus 124 and the memories 132 and 134 or the processors 128. Also connected to the system bus 124 is a graphics adapter 136. A display monitor 138 may be connected to the graphics adapter 136. As shown, an Ethernet adapter 160 may be connected to the system bus 124.
Connected to the I/O expansion bus 126 is a disk memory 140 and interface, such as an IDE interface, a modem 158, and input devices 142 such as keyboard 144 and mouse 146. (Alternatively, the keyboard 144 and mouse 146 may also be connected to the USB hub 150.) Also connected between the system bus 124 and the I/O expansion bus 126 is a south bridge 148 or I/O bridge. The south bridge 148 enables data transfers between devices on the I/O expansion bus 126, such as modem 158, and devices on the USB 200 or devices on the system bus 124. Illustratively, according to the invention, the south bridge 148 also includes a USB hub 150. The USB hub 150 has one or more serial ports 152 that are connected to standard USB compliant connectors 154 to which a connection may be made totally externally to the housing 156 of the computer system. Illustratively, the USB hubs 150, 117, 168, 190 and cables 119 form the USB bus 200.
The camera 110 is shown as including an imaging device 111, such as a tube, CMOS photo sensor or CCD, on which video images are incident. The imaging device 111 converts the image to a motion picture video signal representative thereof. The video signal is converted to digital form in ADC 113. The digital signal outputted from ADC 113 is received at a bit rate reduction circuit 115. The bit-rate reduction circuit 115 may be a programmable frame rate/resolution reduction circuit. Advantageously, however, the bit rate reduction circuit is a programmable video encoder. The bit rate reduced video signal is outputted to a USB hub circuit 117. The USB hub circuit 117 has a serial port 118 that can output the video signal as a serial bitstream via cable 119). The cable 119, which is plugged into the connector 154 (externally to the computer housing 156), delivers the video signal to the serial port 152 of the hub circuit 150 in the south bridge 148.
The reduction of the bit rate by the video encoder 115 ensures that the video signal has a sufficiently low enough bandwidth to be received by the USB serial port 152. Various compression standards such as MPEG-1, MPEG-2, H.263, etc. may be used by the bit rate reduction circuit 115 to encode the video signal.
Note that the USB 200, in particular, the serial ports 118 and 154 of the hubs 150, 117, 168 support bidirectional transfer of signals. In addition to transferring video signals from the hub 117 to the hub 150, data may be transferred from the hub 150 to the hub 117 by interspersing the video signal and the data transfer signal. Such data transfers can be used to program/adjust the video encoder 115. For example, the video encoder 115 can be programmed to encode the video in compliance with a number of compression standards such as, H.263, MPEG-1, MPEG-2, JPEG, motion JPEG, etc. Furthermore, within any given standard, different parameters may be adjusted such as quantization step sizes, inter/intra decision thresholds, group of picture formats, bit rate, etc and different encoding options, such as arithmetic coding, may be selected.
Advantageously, a microphone 162 receives an audible sound and converts it to an audio signal in real time as the camera 110 receives an image. An ADC 164 digitizes the audio signal and an audio encoder 166 encodes the audio signal. Illustratively, a USB hub circuit 168 receives the compressed audio signal and transmits it in bit serial form from serial port 170 to the hub 117, interspersed with the video signal outputted from the camera 110 and any other data signal transmitted on the USB 200.
The hub 150 receives the bit rate reduced video (and illustratively the compressed audio signal). The received signals may be transferred via south bridge 148, system bus 124, and north bridge 130 into one of the memories 132 or 134. From there, the video and/or audio signal may be processed by the processor 128, e.g., error protected using an error protection code, encoded, if necessary, etc. The video and/or audio signal may then be outputted (in multiplexed form) via north bridge 130, system bus 124, Ethernet adapter 160 and an ethernet network to a far end, remote video conferencing system 100' of similar architecture as the video conferencing system 100 (i.e., having a computer system 120' and camera 110'). Alternatively, or in addition, the compressed video and/or compressed audio signals can be outputted via north bridge 130, system bus 124, south bridge 148, I/O expansion bus 126, modem 158 and a public telephone network to the far end, remote video conferencing system 100'. In another embodiment, the compressed video and/or compressed audio signals received at the hub 150 are outputted directly to the Ethernet adapter 160 or modem 158, both of which can be connected to the USB 200.
A compressed video and/or compressed audio signal may be received from the far end, remote video conferencing system 100' at the near end, local video conferencing system 100 shown in FIG. 4. The compressed video and/or compressed audio signals may be received at the Ethernet adapter 160 or at the modem 158. A compressed video and/or compressed audio signal received at the Ethernet adapter 160 may be transferred via system bus 124 and north bridge 130 to main memory 132 or cache memory 134. Alternatively, if the compressed video and compressed audio signals are received at the modem 158, the compressed video and compressed audio signals are transferred via the I/O expansion bus 126, south bridge 148, system bus 124 and north bridge 130 to the memory 132 or 134. From there, the processor 128 may separate the compressed video and compressed audio signals for further processing such as error correction, decryption, and decoding. Alternatively, a special purpose processor (not shown) may be connected to the system bus 124 for performing at least the video signal decoding. In yet another embodiment, a special processor for performing video decoding may be included with the graphics adapter 136 to which the compressed video signal is directly transferred (i.e., from the modern 158 or Ethernet adapter 160). The decoded video signal is transferred to the graphics adapter 136 (or is present thereat). The graphics adapter 136 outputs the decoded video signal on the display monitor 138. In addition, the decoded audio signal is also received via the graphics adapter 136 and outputted to a loudspeaker contained in the display monitor 138.
In the alternative digital video capture embodiment, a digital or analog video signal produced by a camera 110" is outputted to an adaptor 180. The adaptor 180 has a video encoder 195 with a built in USB hub 190. The USB hub 190 is part of the USB 200. The video encoder 195 encodes a digital version of the received video signal in an analogous fashion as above, and transfers the compressed video signal via the (USB hub 190) and USB 200 to the computer system 200.
The system 100 therefore provides an economical and useful manner for providing video communications on a personal computer system 120. In a typical home or business, the communicating systems 100 and 100' typically use modems and the telephone network to communicate. Recent advances enable duplex communication of up to 33.6 Kbits/sec using an ordinary voice connection, assuming that a "clean" (i.e., low noise) circuit is established between the systems 100 and 100' and both systems 100 and 100' have compliant modems with such capability. Sometimes, a single ISDN connection is used to carry the video conference, thereby affording up to 128 Kbits/sec for each communication.
At such low bit rates, a high level of compression is needed to produce a real-time, low latency compressed moving picture video signal. Moving pictures decoded from such compressed video signals have a. large amount of humanly perceptible compression artifacts. Such artifacts degrade the video signal and lower its quality.
In addition to presenting (displaying) at the local system 100 decoded pictures of the remotely originating compressed video signal, it is also desirable to present decoded pictures of the locally originating video signal. For example, the display screen of the display monitor 138 may be divided into two areas or may display two windows. The first window or area displays pictures decoded from the remotely originating compressed video signal. The second window or area displays pictures decoded from the locally originating compressed video signal (which locally originating compressed video signal is also transmitted to the remote system 100'). The problem is that the displayed pictures of the locally originating compressed video signal are reconstructed from the very same locally compressed video signal that is transmitted to the remote system 100'. As noted above, the communication channel over which the locally originating compressed video signal is transmitted has a limited bandwidth. As a result, the locally originating compressed video signal must be highly compressed so that fidelity degrading compression artifacts are introduced into the reconstructed pictures. While such compression artifacts (in the pictures reconstructed from the locally originating compressed video signal) must be tolerated at the remote system 100' considering the bandwidth limitations of the communication channel from which the locally originating compressed video signal is received, such channel bandwidth constraints do not exist at the local system 100 vis-a-vis the locally originating compressed video signal. Thus, such degradations in fidelity of locally displayed pictures reconstructed from locally originating compressed video at the local system 100 are disadvantageous and unnecessary.
It is an object of the present invention to overcome the disadvantages of the prior art.