Screen capture tools let a computer user record an image displayed on a visual display unit such as a computer monitor. The user might use the captured screen area (alternatively called a screen area, screen image, screen shot, screen frame, screen region, capture area, capture image, capture shot, etc.) in a help manual or report to show the results displayed on the display unit at a particular time.
For some applications, a user captures a series of screen areas to show how screen content changes. The user might use the series of captured screen areas within an instructional video for job training or remote instruction. Changes in screen content can occur, for example, when windows or menus are opened, closed, moved, resized, or scrolled.
FIG. 1a is a captured screen area (100) of a computer desktop environment according to the prior art. The captured screen area (100) shows the entire desktop, but could instead show only the window (130) or some other portion of the desktop. A cursor graphic (140) overlays the window (130), and several icon graphics (120, 122, 124) overlay the background (110). FIG. 1b shows a captured screen area (101) following the captured screen area (100) of FIG. 1a in a series according to the prior art. Much of the screen content shown in FIGS. 1a and 1b is identical. Screen content such as the background (110) and icon graphics (120, 122, 124) usually does not change from frame to frame. On the other hand, the cursor graphic (140) often changes position and shape as the user manipulates a mouse or other input device, and the position and contents of the window (130) often change as a user moves or resizes the window, types, adds graphics, etc. FIG. 1b shows the cursor graphic (140) and the window (130) changing locations as the user drags the window (130) across the desktop, which in turn changes which portions of the background (110) are exposed.
Screen capture video and other forms of digital video consume large amounts of storage and transmission capacity. A typical screen capture video sequence may include 10 or more frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels). Each pixel is a tiny element of the picture. In raw form, a computer commonly represents a pixel with 24 bits. Thus, the number of bits per second, or bitrate, of a raw digital video sequence can be 5 million bits/second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers often use compression (also called coding or encoding) to reduce the bitrate of digital video. Compression can be lossless, in which quality of the video does not suffer but decreases in bitrate are limited by the complexity of the video. Or, compression can be lossy, in which quality of the video suffers but decreases in bitrate are more dramatic. Decompression reverses compression.
I. Compression Techniques for Camera Video
Numerous techniques have been developed for compressing conventional camera video. Such techniques include intraframe compression techniques (in which a frame is compressed as a still image) and interframe compression techniques (in which a frame is predicted or estimated from one or more other frames). Intraframe compression often involves frequency transformations on data followed by lossy and lossless compression. Interframe compression can include motion estimation.
Motion estimation is a process for estimating motion between frames. In one common technique, an encoder using motion estimation attempts to match a block of pixels in a current frame with a similar block of pixels in a search area in another frame (called the reference frame). When the encoder finds an exact or “close enough” match in the search area in the reference frame, the encoder parameterizes the change in position of the blocks as motion data (such as a motion vector).
Conversely, motion compensation is a process of reconstructing frames from reference frames using motion data. In one common technique, an encoder or decoder reconstructs a current frame by applying motion data for the current frame to a reference frame, creating a predicted frame. The encoder can compress the difference (sometimes called the residual) between the predicted frame and the original version of the current frame using the same techniques as used for intraframe compression (e.g., lossy and lossless compression). The overall bitrate of the camera video depends very much on the bitrate of the residuals, which can predominate in the overall bitrate compared to the bitrate for motion data. The bitrate of residuals is low if the residuals are simple (i.e., due to motion estimation that leads to exact or good matches according to some criteria), or if lossy compression drastically reduces the complexity of the residuals. On the other hand, the bitrate of complex residuals (i.e., those for which motion estimation fails to find good matches) can be higher, depending on the degree of lossy compression applied to reduce the complexity of the residuals.
The goal of motion estimation for camera video is usually to minimize the variance of the residual following motion estimation. Variance is an estimate for complexity/compressibility of the residual for camera video, and minimizing the variance tends to find a match for which fewer bits are needed to code the residual at a given distortion level. An exact match rarely occurs for a block of pixels in camera video and is not essential, since the variance of the residual is approximately minimized at an approximate match. (An approximate match may result in a slightly more complex residual for the block than the best match would, but a slight increase in the complexity of the residual usually does not dramatically increase bitrate. Rather, lossy compression of the residual reduces the slightly increased bitrate without introducing an objectionable amount of distortion upon reconstruction of the block.) Accordingly, distortion measures such as mean absolute difference, mean squared error, sum of squared errors, or some other variation of Euclidean L2 norm are conventionally used in motion estimation for camera video.
For camera video, motion estimation may use a hierarchical search to find general motion data then more precise motion data for a block of pixels. The measure being minimized (e.g., mean squared error or sum of absolute differences) to find a suitable match usually decreases monotonically on the approach to the suitable match. For example, a graph of the distortion measure often has a “bowl” shape: the best match has the minimum distortion (i.e., bottom of the bowl); for other matches, the distortions get worse as the matches get farther away from the best match (i.e., climbing the sides of the bowl). The hierarchical search thus improves search speed by finding good low-precision motion data, and then finding better, higher precision motion data around it.
II. Compression Techniques for Screen Capture Video
Some encoding tools allow coding of screen capture video with any of multiple encoders on a system. The multiple encoders can include, for example, a screen capture encoder that uses lossless compression and conventional video encoders that use a combination of lossy and lossless compression.
Screen capture images often contain a mixture of continuous tone content and palettized content. Continuous tone content includes, for example, photographs or other images with gradually varying colors or tones, and typically uses a range of image tones that appears substantially continuous to the human eye. While it is desirable to encode continuous tone content using only lossless compression if sufficient resources are available, lossy compression can be used (i.e., with a conventional video encoder) to effectively compress continuous tone content at a lower bitrate. The lossy compression, however, can introduce unacceptable distortion in palettized content.
Palettized content includes, for example, icons, toolbars, and command or notepad windows consisting of a flat color background and foreground text of a contrasting color. A color palette typically includes a relatively small set of image colors or tones (e.g., 256 different 24-bit colors). Palettized content often includes areas of perceptually important fine detail—spatially localized, high frequency variations depicting text elements or other image discontinuities. Applying lossy compression to palettized content can result in the loss of perceptually important fine detail. For example, text and sharp edges may be blurred or distorted in the decompressed content. As a result, lossless encoding of palettized content is preferred in many circumstances. Because screen content often includes palettized content, most prior art screen capture encoders use lossless compression to compress screen capture video.
One prior art screen capture encoder uses a lossless encoding algorithm with a pixel map when coding a current frame of screen capture content with interframe compression. The encoder compares pixels at locations (e.g., x, y coordinates) in the current frame with corresponding pixels at the same locations in the previous frame. The pixel map indicates locations at which pixels in the current frame have changed in value and locations at which pixels have not changed in value. For the pixels in the current frame that have not changed in value, the values from the previous frame are used. The encoder then codes the changed pixels (called the intra pixels). In such cases, the number of intra pixels in the current frame is often a good indication of the number of bits needed to code the current frame because coding other data for the current frame (e.g., the map) usually consumes relatively few bits.
This interframe compression is efficient when screen content is fairly static since the number of intra pixels is zero or small. On the other hand, this interframe compression can be inefficient when the number of intra pixels is large, and in screen capture video, even small on-screen movements can change the values of large numbers of pixels.