The present invention relates to a method and apparatus for digitally encoding video image data, and is particularly suited for encoding Internet Web pages for transmission and display.
With the ever-increasing popularity of the Internet, a number of systems and devices have appeared in the marketplace that substantially reduce the initial equipment expense required for accessing the Internet. For example, inexpensive dedicated processors are available which enable a user to access the Internet using a telephone line, and download Internet Web pages for display on the user""s television set.
Recently, an even more attractive Internet access system has been proposed which completely eliminates the need for a user to have a telephone line and a dedicated processor running a browser application locally at their premises. This system employs a modified cable television (CATV) system that uses the downstream cable channels to transmit Internet-based information to the system users via for display on their television sets. Each user is provided with a set top converter box that has been modified to enable entry of data or commands via a keyboard, remote controller or other input device. One or more upstream channels are provided which transmit the entered data or commands to a headend server in the CATV system. The headend server is interfaced to the Internet via an Internet Service Provider (ISP), for example, and includes processing equipment which can simultaneously operate a plurality of resident Internet browser applications, one for each system user requesting Internet access. The headend server therefore contains all of the processing equipment necessary to access the Internet through the ISP, while each user""s set top box acts as an input/output device for interfacing the user to the Internet.
In the operation of the system, a user requests Internet access by entering an appropriate command into the set top box that transmits the command through an upstream channel to the headend server. In response, the headend server connects the user to one of the resident browser applications via one of the system""s downstream channels.
The Internet-based information, e.g., Web pages, can be transmitted through the downstream channel in a number of ways. In an analog implementation, for example, the Internet data can be inserted into the vertical or horizontal blanking intervals of the conventional analog television signals which are simultaneously transmitted on the selected downstream channel. In an all-digital embodiment, however, the Internet data must be encoded in the same format that is employed for digitally encoding video signals. More particularly, the data must be encoded using standardized procedures for encoding, storing, transporting and displaying continuous video frames that have been specified by The Motion Picture Experts Group (MPEG). Thus, the image bit map generated by the browser application is not rendered at the headend, but instead is further compressed by an MPEG image encoder. It is the compressed image data that is transmitted to a user.
MPEG encoding is a video image compression technique that substantially reduces the amount of motion picture image data that must be transmitted. This data reduction is made possible because spatial redundancy exists within an image frame (intra frame compression). In addition, each succeeding frame in a motion picture video usually contains substantial temporal redundancy, i.e., portions which have either not changed from the previous frame, or have only been moved relative to the previous frame (inter frame compression). When spatial redundancy is removed from a frame, the frame is said to be encoded as an intra-coded frame (I-frame). In an inter frame compression scheme, two different compression algorithms may be employed to generate two kinds of encoded frames. A compressed image frame is called a Predictive-coded frame (P-frame) if only a prior frame is compared and the difference is coded. Another inter frame compression results in a Bidirectionally predictive-coded frame (B-frame) if both a prior frame and a post frame are used for encoding. In these cases, it is not necessary to transmit all of the image data for each frame. Instead, only the difference data representing the portions in the current frame that have changed from the neighboring (previous or later) frame(s) is transmitted. For areas in an image which have been moved relative to the previous frame, it is possible to search for these areas, and then generate a motion vector which instructs a receiving decoder to construct a portion of the next image frame by moving a corresponding portion in the previous image frame a specified displacement and direction.
To encode a sequence of video frames, the first frame is encoded as an intra or I frame where information for all of the pixels in the frame needs to be transmitted since no previous frame information is available. The next frame in the sequence can then be encoded either as an P (predictive) frame or a B (bi-directional predictive-coded) frame which includes only the difference or motion vector data resulting from the frame comparisons. P or B frames can continue to be used for encoding the succeeding frames in the sequence until a substantial change, such as a scene change, occurs, thus necessitating formation of another I frame. In practice, however, the encoder is programmed to encode I frames at a constant rate, such as for every other N frames. The MPEG encoding procedure thus compresses images by suppressing statistical and subjective redundancy inter and intra frames. An MPEG decoder is capable of decompressing the coded image close to its original format so that the decompressed image may be displayed on a display device, such as a television or computer monitor.
In the Internet Web page display application, only P frames are usually employed for inter frame compression because B frame coding requires comparison with post (later in time) frames which are not available immediately. However, a B frame can be encoded by forward comparison only between the current frame and the prior frame as a special case, and in this instance, can also be employed for Web page inter frame compression.
In the application of MPEG encoding to the previously described CATV system, each user""s set top box includes an MPEG decoder for decoding the digital video bit stream received on the downstream channels. This requires that any Internet Web page image data to be transmitted to the set top boxes also be MPEG encoded. An MPEG encoder is thus incorporated in the cable headend to encode the browser generated Web page image data, which usually is a bit map, before it is transmitted on one of the downstream channels to a user""s set top box.
In general, however, MPEG encoding of Web page image data is needlessly intensive from a computation standpoint since Web pages do not usually incorporate full motion video, and often appear to be nothing more than a still image. Strictly speaking, though, the Web page is not a still image. Due to the limited viewing size of a display device, the Web page is usually larger than the display device""s viewing area. A user may therefore scroll a Web page to move the page horizontally or vertically to view the whole page. Depending on the speed at which the page is scrolling, the images on the display device may thus be considered to be a series of video frames displayed at a variable frame rate. Other Web pages may contain a small animation window in which several localized pictures are alternatively displayed at a certain rate. JAVA applets animation and regional character updates which occur as a user types an e-mail message are other examples of this local animation scenario. In both of these cases, MPEG inter frames may be constructed after the generation of a first, intra fame, to reduce the number of bits needed to represent each frame, thus substantially reducing the required bandwidth in the communication link.
As discussed previously, when an inter frame is generated, motion vectors must be found, coded and transmitted so that the MPEG decoder can reform the frame. A motion vector search is one of the most difficult tasks in designing an MPEG encoder. Since the MPEG committee defined only the syntax and semantics of a compressed frame, but did not define how motion vectors searching should be implemented, numerous proprietary motion vector search algorithms were developed by various encoder vendors. For continuous video compression, however, a motion vector search is very complicated and requires a large percentage of the entire encoding computational effort. More particularly, in MPEG encoding, each video frame to be encoded is subdivided into a plurality of multiple 64 (8xc3x978) pixel blocks, and four such blocks covering a 16xc3x9716 pixel area are known as a macroblock. During encoding, the MPEG encoder searches for the best match between each macroblock of a present frame to be encoded with the corresponding macroblock in the previous frame. This search for the best match is known as motion estimation.
The existing algorithms for motion estimation fall into two categories: feature/region matching and gradient-based. In the first category, both block matching and hierarchical block matching can be employed for motion estimation. For encoding a continuous video, the encoder has to search the entire screen (exhaustive search) to find the best match because the encoder knows nothing about the motion from frame to frame. In gradient-based motion estimation, the exhaustive search may be avoided at the price of solving linear equations during search.
All of the algorithms require many iterations to complete the motion estimation. After the best match is found, the difference between the matched macroblocks is calculated by comparing the macroblocks. If the difference is small enough, a motion vector is generated which determines the direction and offset of the motion. Both the difference and the motion vector are encoded and transmitted. If the difference is larger than a threshold, the macroblock of the present frame is allowed to be intra compressed as one encoded in an I frame.
In view of the foregoing, any video image encoding technique that eliminates the need for motion vector search algorithms would be desirable in view of the resulting substantial savings in computation time and intensity.
The present invention provides an encoding technique for encoding low-frame rate video image data, such as Internet Web pages, in which motion vectors are generated without search algorithms by taking advantage of prior knowledge regarding one or more characteristics of the images. In the preferred embodiments of the invention, the image characteristics are provided to an encoder, such as an MPEG encoder, from an image generating application, and relate to movement of or in the images.
More particularly, both embodiments of the invention are designed specifically for use with CATV systems, as discussed previously, which include Internet access capabilities. In these systems, when a user scrolls through a Web page, scrolling input signals are sent by the user""s set top box to the browser application in the headend. These signals define the direction of the scrolling and its offset, typically in terms of x and y coordinates. In addition, the Web pages may contain one or more animation windows, the graphical content of which alternates or changes every second or so. The browser application can easily detect whether one or more animation windows is present in the Web page image, and if so, determine the coordinates of the animation window(s). The scrolling coordinate and animation window information can also be employed by the encoder to determine the exact change between a previous image frame and a present image frame that has occurred as a result of the scrolling and/or animation window movement. With this knowledge, a motion vector search is unnecessary, and can be replaced with a set of calculations employing the scrolling coordinates.
In the first preferred embodiment of the present invention, the encoder employs the scrolling coordinates to determine motion estimation for all of the macroblocks in the present frame relative to the previous frame in a single step, and without a multiple iteration search. A comparison between the macroblock of the present frame and the corresponding macroblock of the previous frame determined by the motion estimation, indicates whether the changed macroblock is the same as the corresponding macroblock in the previous frame which has been shifted in the direction and amount specified by the scrolling coordinates. If so, the motion vector for this macroblock of the frame has been located, and the motion vector and the difference between the macroblocks is encoded and transmitted. The process is repeated for each macroblock in the frame to generate the resulting inter frame. The resulting motion vector calculation and algorithm using the scrolling coordinates requires much less computation than a full search algorithm.
In the second preferred embodiment, the encoder receives animation window or other information from the browser application that indicates that certain portions of an image are continuously changing, and thus should be encoded as an intra frame. If the browser application detects that one or more animation windows are present in the Web page image, it determines the coordinates of the animation window(s), and passes the coordinates to the encoder. The encoder knows that only the portions of the Web page enclosed by the animation window will undergo changes from frame to frame, absent any scrolling operations. Thus, if the encoder receives animation window coordinates from the browser application, the encoder knows that it can encode the present frame of the Web page by encoding only those macroblocks that are contained in the one or more animation windows. These are encoded either as intra macroblocks (no need for motion estimation) or as forward predictive coded macroblocks by performing a motion estimation constrained within the animation window. The remaining macroblocks are encoded as zero motion vector blocks, which means that they have not changed from the previous frame.