The concept of recording and using timing information is fundamental to the needs of multimedia applications. Pictures, video, text, graphics, and sound need to be recorded with some understanding of the time associated with each sample of the media stream. This is useful for synchronizing different multimedia streams with each other, for carrying information to preserve the original timing of the media when playing a media stream, for identifying specific locations within a media stream, and for recording the time associated with the media samples to create a scientific or historical record. For example, if audio and video are recorded together but handled as separate streams of media data, then timing information is necessary to coordinate the synchronization of these two (or more) streams.
Typically, a media stream (such as a recorded audio track or recorded video or film shot) is represented as a sequence of media samples, each of which is associated (implicitly or explicitly) with timing information. A good example of this is video and motion picture film recording, which is typically created as a sequence of pictures, or frames, each of which represents the camera view for a particular short interval of time (e.g., typically 1/24 seconds for each frame of motion picture film). When this sequence of pictures is played back at the same number of frames per second (known as the “frame rate”) as used in the recording process, an illusion of natural movement of the objects depicted in the scene can be created for the viewer.
Similarly, sound is often recorded by regularly sampling an audio waveform to create a sequence of digital samples (for example, using 48,000 samples per second) and grouping sets of these samples into processing units called frames (e.g., 64 samples per frame) for further processing such as digital compression encoding or packet-network transmission (such as Internet transmission). A receiver of the audio data will then reassemble the frames of audio that it has received, decode them, and convert the resulting sequence of digital samples back into sound using electro-acoustic technology.
FIG. 1 illustrates a conventional system 100 for processing and distributing video content. The video content is captured using a video camera 102 (or any other video capture device) that transfers the captured video content onto video tape or another storage medium. Later, the captured video content may be edited using a video editor 104. A video encoder 106 encodes the video content to reduce the storage space required for the video content or to reduce the transmission bandwidth required to transmit the video content. Various encoding techniques may be used to compress the video content, such as the MPEG-2 (Moving Picture Experts Group 2nd generation) compression format.
The encoded video content is provided to a transmitter 108, which transmits the encoded video content to one or more receivers 110 across a communication link 112. Communication link 112 may be, for example, a physical cable, a satellite link, a terrestrial broadcast, an Internet connection, a physical medium (such as a digital versatile disc (DVD)) or a combination thereof A video decoder 114 decodes the signal received by receiver 110 using an appropriate decoding technique. The decoded video content is then displayed on a video display 116, such as a television or a computer monitor. Receiver 110 may be a separate component (such as a set top box) or may be integrated into video display 116. Similarly, video decoder 114 may be a separate component or may be integrated into the receiver 110 or the video display 116.
Proper recording and control of timing information is needed to coordinate multiple streams of media samples, such as for synchronizing video and associated audio content. Even the use of media which does not exhibit a natural progression of samples through time will often require the use of timing information in a multimedia system. For example, if a stationary picture (such as a photograph, painting, or document) is to be displayed along with some audio (such as an explanatory description of the content or history of the picture), then the timing of the display of the stationary picture (an entity which consists of only one frame or sample in time) may need to be coordinated with the timing of the associated audio track.
Other examples of the usefulness of such timing information include being able to record the date or time of day at which a photograph was taken, or being able to specify editing or viewing points within media streams (e.g., five minutes after the camera started rolling).
In each of the above cases, a sample or group of samples in time of a media stream can be identified as a frame, or fundamental processing unit. If a frame consists of more than one sample in time, then a convention can be established in which the timing information represented for a frame corresponds to the time of some reference point in the frame such as the time of the first, last or middle sample.
In some cases, a frame can be further subdivided into even smaller processing units, which can be called fields. One example of this is in the use of interlaced-scan video, in which the sampling of alternating lines in a picture are separated so that half of the lines of each picture are sampled as one field at one instant in time, and the other half of the lines of the picture are then sampled as a second field a short time later. For example, lines 1, 3, 5, etc. may be sampled as one field of picture, and then lines 0, 2, 4, etc. of the picture may be sampled as the second field a short time later (for example 1/50th of a second later). In such interlaced-scan video, each frame can be typically separated into two fields.
Similarly, one could view a grouping of 64 samples of an audio waveform for purposes of data compression or packet-network transmission to be a frame, and each group of eight samples within that frame to be a field. In this example, there would be eight fields in each frame, each containing eight samples.
In some methods of using sampled media streams that are well known in the art, frames or fields may consist of overlapping sets of samples or transformations of overlapping sets of samples. Two examples of this behavior are the use of lapped orthogonal transforms [1) Henrique Sarmento Malvar, Signal Processing with Lapped Transforms, Boston, Mass., Artech House, 1992; 2) H. S. Malvar and D. H. Staelin, “The LOT: transform coding without blocking effects,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, pp. 553–559, April 1989; 3) H. S. Malvar, Method and system for adapting a digitized signal processing system for block processing with minimal blocking artifacts, U.S. Pat. No. 4,754,492, June 1988.] and audio redundancy coding [1) J. C. Bolot, H. Crepin, A. Vega-Garcia: “Analysis of Audio Packet Loss in the Internet”, Proceedings of the 5th International Workshop on Network and Operating System Support for Digital Audio and Video, pp. 163–174, Durham, April 1995; 2) C. Perkins, I. Kouvelas, O. Hodson, V. Hardman, M. Handley, J. C. Bolot, A. Vega-Garcia, S. Fosse-Parisis: “RTP Payload for Redundant Audio Data”, Internet Engineering Task Force Request for Comments RFC2198, 1997.]. Even in such cases it is still possible to establish a convention by which a time is associated with a frame or field of samples.
In some cases, the sampling pattern will be very regular in time, such as in typical audio processing in which all samples are created at rigidly-stepped times controlled by a precise clock signal. In other cases, however, the time between adjacent samples in a sequence may differ from location to location in the sequence.
One example of such behavior is when sending audio over a packet network with packet losses, which may result in some frames not being received by the decoder while other frames should be played for use with their original relative timing. Another example of such behavior is in low-bit-rate videoconferencing, in which the number of frames sent per second is often varied depending on the amount of motion in the scene (since small changes take less data to send than large changes, and the overall channel data rate in bits per second is normally fixed).
If the underlying sampling structure is such that there is understood to be a basic frame or field processing unit sampling rate (although some processing units may be skipped), then it is useful to be able to identify a processing unit as a distinct counting unit in the time representation. If this is incorporated into the design, the occurrence of a skipped processing unit may be recognized by a missing value of the counting unit (e.g., if the processing unit count proceeds as 1, 2, 3, 4, 6, 7, 8, 9, . . . , then it is apparent that count number 5 is missing).
If the underlying sampling structure is such that the sampling is so irregular that there is no basic processing unit sampling rate, then what is needed is simply a good representation of true time for each processing unit. Normally however, in such a case there should at least be a common time clock against which the location of the processing unit can be referenced.
In either case (with regular or irregular sampling times), it is useful for a multimedia system to record and use timing information for the samples or frames or fields of each processing unit of the media content.
Different types of media may require different sampling rates. If timing information is always stored with the same precision, a certain amount of rounding error may be introduced by the method used for representing time. It is desirable for the recorded time associated with each sample to be represented precisely in the system with little or no such rounding error. For example, if a media stream operates at 30,000/1001 frames per second (the typical frame rate of North American standard NTSC broadcast video—approximately 29.97 frames per second) and the precision of the time values used in the system is to one part in 10−6 seconds, then although the time values may be very precise in human terms, it may appear to processing elements within the system that the precisely-regular sample timing (e.g. 1001/30,000 seconds per sample) is not precisely regular (e.g. 33,366 clock increment counts between samples, followed by 33,367 increments, then 33,367 increments, and then 33,366 increments again). This can cause difficulties in determining how to properly handle the media samples in the system.
Another problem in finding a method to represent time is that the representation may “drift” with respect to true time as would be measured by a perfectly ideal “wall clock”. For example, if the system uses a precisely-regular sample timing of 1001/30,000 seconds per sample and all samples are represented with incremental time intervals being 33,367 increments between samples, the overall time used for a long sequence of such samples will be somewhat longer than the true time interval—a total of about one frame time per day and accumulating more than five minutes of error after a year of duration.
Thus, “drift” is defined as any error in a timecode representation of sampling times that would (if uncorrected) tend to increase in magnitude as the sequence of samples progresses.
One example of a method of representing timing information is found in the SMPTE 12M design [Society of Motion Picture and Television Engineers, Recommended Practice 12M: 1999] (hereinafter called “SMPTE timecode”). SMPTE timecodes are typically used for television video data with timing specified in the United States by the National Television Standards Committee (NTSC) television transmission format, or in Europe, by the Phase Alternating Line (PAL) television transmission format.
SMPTE timecode is a synchronization signaling method originally developed for use in the television and motion picture industry to deal with video tape technology. The challenge originally faced with videotape was that there was no “frame accurate” way to synchronize devices for video or sound-track editing. A number of methods were employed in the early days, but because of the inherent slippage and stretching properties of tape, frame accurate synchronization met with limited success. The introduction of SMPTE timecode provided this frame accuracy and incorporated additional functionality. Additional sources on SMPTE include “The Time Code Handbook” by Cipher Digital Inc. which provides a complete treatment of the subject, as well as an appendix containing ANSI Standard SMPTE 12M-1986. Additionally, a text entitled “The Sound Reinforcement Handbook” by Gary Davis and Ralph Jones for Yamaha contains a section on timecode theory and applications.
The chief purpose of SMPTE timecode is to synchronize various pieces of equipment. The timecode signal is formatted to provide a system wide clock that is referenced by everything else. The signal is usually encoded directly with the video signal or is distributed via standard audio equipment. Although SMPTE timecode uses many references from video terminology, it may also be used for audio-only applications.
In many applications, a timecode source provides the signal while the rest of the devices in the system synchronize to it and follow along. The source can be a dedicated timecode generator, or it can be (and often is) a piece of the production equipment that provides timecode in addition to its primary function. An example of this is a multi-track audio tape deck that provides timecode on one track and sound for the production on other tracks. Video tape often makes similar use of a cue track or one of its audio sound tracks to record and play back timecode.
In other applications, namely video, the equipment uses timecode internally to synchronize multiple timecode sources into one. An example would be a video editor that synchronizes with timecode from a number of prerecorded scenes. As each scene is combined with the others to make the final product, their respective timecodes are synchronized with new timecode being recorded to the final product.
SMPTE timecode provides a unique address for each frame of a video signal. This address is an eight digit number, based on the 24 hour clock and the video frame rate, representing Hours, Minutes, Seconds and Frames in the following format:HH:MM:SS:FF
The values of these fields range from 00 to 23 for HH, 00 to 59 for MM, 00 to 59 for SS, and 00 to 24 or 29 for FF (where 24 is the maximum for PAL 25 frame per second video and 29 is the maximum for NTSC 30,000/1001 frame per second video). By convention, the first frame of a day is considered to be marked as 00:00:00:01 and the last is 00:00:00:00 (one frame past the frame marked 23:59:59:24 for PAL and 23:59:59:29 for NTSC). This format represents a nominal clock time, the nominal duration of scene or program material and makes approximate time calculations easy and direct.
The frame is the smallest unit of measure within SMPTE timecode and is a direct reference to the individual “picture” of film or video. The frame rate is the number of times per second that pictures are displayed to provide a rendition of motion. There are two standard frame rates (frames/sec) that typically use SMPTE timecode: 25 frames per second and 30,000/1001 frames per second (approximately 29.97 frames per second). The 25 frame per second rate is based on European video, also known as SMPTE EBU (PAL/SECAM color and b&w). The 30,000/1001 frame per second rate (sometimes loosely referred to as 30 frame per second) is based on U.S. NTSC color video broadcasting. Within the 29.97 frame per second use, there are two methods of using SMPTE timecode that are commonly used: “Non-Drop” and “Drop Frame”.
A frame counter advances one count for every frame of film or video, allowing the user to time events down to 1/25th, or 1001/30,000th of a second.
SMPTE timecode is also sometimes used for a frame rate of exactly 30 frames per second. However, the user must take care to distinguish this use from the slightly slower 30,000/1001 frames per second rate of U.S. NTSC color broadcast video. (The adjustment factor of 1000/1001 originates from the method by which television signals were adjusted to provide compatibility between modern color video and the previous design for broadcast of monochrome video at 30 frames per second.)
Thus, the SMPTE timecode consists of the recording of an integer number for each of the following parameters for a video picture: Hours, Minutes, Seconds, and Frames. Each increment of the frame counter is understood to represent an increment of time of 1001/30,000 seconds in the NTSC system and 1/25 seconds in the PAL system.
However, since the number of frames per second in the NTSC system ( 30,000/1001) is not an integer, there is a problem of drift between the SMPTE 12M timecode representation of time and true “wall clock” time. This drift can be greatly reduced by a special frame counting method known as SMPTE “drop frame” counting. Without SMPTE drop frame counting, the drift between the SMPTE timecode's values of Hours, Minutes, and Seconds and the value measured by a true “wall clock” will accumulate more than 86 seconds of error per day. When using SMPTE drop frame counting, the drift accumulation magnitude can be reduced by about a factor of about 1,000 (although the drift is still not entirely eliminated and the remaining drift is still more than two frame sampling periods).
The SMPTE timecode has been widely used in the video production industry (for example, it is incorporated into the design of many video tape recorders). It is therefore very useful if any general media timecode design is maximally compatible with this SMPTE timecode. If such compatibility can be achieved, this will enable equipment designed for the media timecode to work well with other equipment designed specifically to use the SMPTE timecode.
Within this document, the following terminology is used. A timecode describes the data used for representing the time associated with a media sample, frame, or field. It is useful to separate the data of a timecode into two distinct types: the timebase and the timestamp. The timestamp includes the information that is used to represent the timing for a specific processing unit (a sample, frame, or field). The timebase contains the information that establishes the basis of the measurements units used in the timestamp. In other words, the timebase is the information necessary to properly interpret the timestamps. The timebase for a media stream normally remains the same for the entire sequence of samples, or at least for a very large set of samples.
For example, we may interpret the SMPTE timecode as having a timebase that consists of:                Knowledge of (or an indication of) whether the system is NTSC or PAL, and        Knowledge of (or an indication of) whether or not the system uses SMPTE “drop frame” counting in order to partially compensate for drift.        
Given this, the timestamps then consist of the representations of the parameters Hours, Minutes, Seconds, and Frames for each particular video frame.
Many existing systems transmit all parameters of the timestamp with each frame. Since many of the parameters (e.g., hours and minutes) do not typically change from one frame to the next, transmitting all parameters of the timestamp with each frame results in the transmission of a significant amount of redundant data. This transmission of redundant data results in the transmission of more data than is necessary to communicate the current timing information.
The systems and methods described herein provide for the communication of timing indicators that convey timing information using a reduced amount of data.