Many applications, such as remote access software and screen recording software, often encode the contents of a computer screen in real-time. These applications typically represent the contents of a screen as compactly as possible because of bandwidth or storage constraints.
Software for encoding the contents of a computer screen (the encoder) is naturally complemented by software for decoding and displaying the encoded contents (the decoder) at a different location or later time. The encoder typically acquires the contents of a computer screen in one of two ways. Either, 1) output events, such as graphics function calls, at the library or device driver level are intercepted, or 2) the effects of output events, such as rendered lines or circles, are read back from the screen as images. In the first case, screen contents are typically encoded as a sequence of output events, and in the second case, multiple output events are often encoded by a single image and the screen contents are represented as a sequence of images.
For example, U.S. Pat. No. 5,241,625 discloses a system for remotely controlling information displayed on a computer screen by intercepting output events such as graphics calls. Graphics commands which drive a computer window system are captured and saved as a stored record or sent to other computers. A message translation program translates the captured messages for playback on a designated computer.
U.S. Pat. No. 5,796,566 discloses a system in which sequences of video screens forwarded from a host CPU to a video controller, are stored and subsequently retrieved by a terminal located remote from the host CPU. In particular, display data is captured in a local frame buffer which stores the display data frame by frame. A previous frame or screen of display data is compared with a current frame or screen of display data to determine if a change has occurred. The change is then stored.
U.S. Pat. No. 6,331,855 discloses a system that compares, at a predetermined interval, a portion of the image that is currently displayed in a frame buffer to a corresponding portion of a previously displayed image that is stored in system memory to determine if the previously displayed image has changed. If so, the exact extent of the change is determined and it is stored and/or forwarded to a remote computer.
Intercepting output events and representing the contents of a screen in terms of these events often leads to reasonably sized representations. This is because such events are typically high-level and thus provide compact descriptions of changes to a screen. However, implementing this method is often not feasible because it is not easily ported to other platforms, requires administrative privileges (e.g., display driver access is often restricted), requires a reboot (e.g., to install a new device driver), and/or lowers the stability of the overall system (e.g., most remote control packages interfere with one another). On the other hand, representing screen contents by a sequence of images typically leads to very large representations. Large representations usually hinder the overall system performance (i.e., cause perceivable delays).
The size of a sequence of images can be substantially reduced by sophisticated data compression. A particularly space-efficient form of data compression is representing whole blocks of pixels by pointers to earlier occurrences of the same block on the screen as it has been encoded. For example, moving a window or scrolling its contents typically produces a sequence of images where each image contains a large block that occurs verbatim on the previous screen.
While encoding blocks that occur verbatim in previous screens by a pointer is highly space-efficient, doing so in a timely manner is computationally demanding because, in the general case, it requires an exhaustive search.
A related problem is motion-compensated video signal coding where motion estimation is used to predict the current frame and to encode the difference between the current frame and its prediction. Typically, motion vectors are only determined and coded for a subset of pixels such as, for example, a sparse grid of pixels. Motion vectors for the remaining pixels are estimated from the first set of motion vectors by, for example, dividing the frame into blocks and assigning the same motion vector to all pixels in each block. For a video signal, a motion field can be interpolated without adverse effects because pixel levels within a local window are typically smooth.
For example, U.S. Pat. No. 5,751,362 discloses an apparatus that 1) identifies regions of motion by comparing blocks in a previous frame and a current frame 2) selects a first set of pixels, i.e., features, from the previous frame using a grid and/or edge detection, 3) determines a first set of motion vectors using, for example, a block matching algorithm (BMA), and 4) estimates motion vectors for all remaining pixels by computing affine transformations for non-overlapping polygons (e.g., triangles). The polygons are obtained by connecting feature points, which have been translated into the current frame, in a predetermined way.
In U.S. Pat. No. 5,751,362 the following BMA is employed. Given a block in the current frame, the BMA finds the best matching block in the previous frame according to a criteria such as, for example, the minimum mean square error. While an exhaustive search is by far too slow for on-the-fly encoding, limiting the maximum displacement and iteratively evaluating only a subset of all candidate blocks and, in each step, proceeding in the direction of a local optimum substantially reduces the computational burden. These optimizations are based on two assumptions: blocks typically move by only a few pixels, and the distortion between the previous and the current frame is smooth across the search window. While this is typically the case for video signals, screen contents (e.g. graphic computer interfaces, display windows, etc.) are inherently different.
Accordingly, there is a need for a method and apparatus that quickly detects variable-size blocks in an image that also occur verbatim in a reference image by exploiting the distinct characteristics of typical screen contents.