The concept of recording and using timing information is fundamental to the needs of multimedia applications. Pictures, video, text, graphics, and sound need to be recorded with some understanding of the time associated with each sample of the media stream. This is useful in order to synchronize different multimedia streams with each other, for carrying information to preserve the original timing of the media when playing a media stream for a user, for identifying specific locations within a media stream, and also for recording the time associated with the media samples for purposes of creating a scientific or historical record. For example, if audio and video are recorded together but handled as separate streams of media data, then timing information is necessary for coordinating the synchronization of these two (or more) streams.
Typically, a media stream (such as a recorded audio track or recorded video or film shot) is represented as a sequence of media samples, each of which is associated (implicitly or explicitly) with timing information. A good example of this is video and motion picture film recording, which is typically created as a sequence of pictures, or frames, each of which represents the camera view for a particular short interval of time (e.g., typically 1/24 seconds for each frame of motion picture film). When this sequence of pictures is played back at the same number of frames per second (known as the frame rate) as used in the recording process, an illusion of natural movement of the objects depicted in the scene can be created for the viewer.
Similarly, sound is often recorded by regularly sampling an audio waveform to create a sequence of digital samples (for example, using 48,000 samples per second) and grouping sets of these samples into processing units called frames (e.g., 64 samples per frame) for further processing such as digital compression encoding or packet-network transmission (such as Internet transmission). A receiver of the audio data will then reassemble the frames of audio that it has received, decode them, and convert the resulting sequence of digital samples back into sound using electro acoustic technology.
Proper recording and control of timing information is required for coordinating multiple streams of media samples, such as for synchronizing video and associated audio content. Even the use of media which does not exhibit a natural progression of samples through time will often require the use of timing information in a multimedia system. For example, if a stationary picture (such as a photograph, painting, or document) is to be displayed along with some audio (such as an explanatory description of the content or history of the picture), then the timing of the display of the stationary picture (an entity which consists of only one frame or sample in time) may need to be coordinated with the timing of the associated audio track.
Other examples of the usefulness of such timing information include being able to record the date or time of day at which a photograph was taken, or being able to specify editing or viewing points within media streams (e.g., five minutes after the camera started rolling).
In each of the above cases, a sample or group of samples in time of a media stream can be identified as a frame, or fundamental processing unit. If a frame consists of more than one sample in time, then a convention can be established in which the timing information represented for a frame corresponds to the time of some reference point in the frame such as the time of the first, last or middle sample.
In some cases, a frame can be further subdivided into even smaller processing units, which can be called fields. One example of this is in the use of interlaced-scan video, in which the sampling of alternating lines in a picture are separated so that half of the lines of each picture are sampled as one field at one instant in time, and the other half of the lines of the picture are then sampled as a second field a short time later. For example, lines 1, 3, 5, etc. may be sampled as one field of picture, and then lines 0, 2, 4, etc. of the picture may be sampled as the second field a short time later (for example 1/50th of a second later). In such interlaced-scan video, each frame can be typically separated into two fields.
Similarly, one could view a grouping of 64 samples of an audio waveform for purposes of data compression or packet-network transmission to be a frame, and each group of eight samples within that frame to be a field. In this example, there would be eight fields in each frame, each containing eight samples.
In some methods of using sampled media streams that are well known in the art, frames or fields may consist of overlapping sets of samples or transformations of overlapping sets of samples. Two examples of this behavior are the use of lapped orthogonal transforms [1) Henrique Sarmento Malvar, Signal Processing with Lapped Transforms, Boston, Mass., Artech House, 1992; 2) H. S. Malvar and D. H. Staelin, “The LOT: transform coding without blocking effects,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, pp. 553–559, April 1989; 3) H. S. Malvar, Method and system for adapting a digitized signal processing system for block processing with minimal blocking artifacts, U.S. Pat. No. 4,754,492, June 1988.] and audio redundancy coding [1) J. C. Bolot, H. Crepin, A. Vega-Garcia: “Analysis of Audio Packet Loss in the Internet”, Proceedings of the 5th International Workshop on Network and Operating System Support for Digital Audio and Video, pp. 163–174, Durham, April 1995; 2) C. Perkins, I. Kouvelas, O. Hodson, V. Hardman, M. Handley, J. C. Bolot, A. Vega-Garcia, S. Fosse-Parisis: “RTP Pyaload for Redundant Audio Data”, Internet Engineering Task Force Request for Comments RFC2198, 1997.]. Even in such cases it is still possible to establish a convention by which a time is associated with a frame or field of samples.
In some cases, the sampling pattern will be very regular in time, such as in typical audio processing in which all samples are created at rigidly-stepped times controlled by a precise clock signal. In other cases, however, the time between adjacent samples in a sequence may differ from location to location in the sequence.
One example of such behavior is when sending audio over a packet network with packet losses, which may result in some frames not being received by the decoder while other frames should be played for use with their original relative timing. Another example of such behavior is in low-bit-rate videoconferencing, in which the number of frames sent per second is often varied depending on the amount of motion in the scene (since small changes take less data to send than large changes, and the overall channel data rate in bits per second is normally fixed).
If the underlying sampling structure is such that there is understood to be a basic frame or field processing unit sampling rate (although some processing units may be skipped), then it is useful to be able to identify a processing unit as a distinct counting unit in the time representation. If this is incorporated into the design, the occurrence of a skipped processing unit may be recognized by a missing value of the counting unit (e.g., if the processing unit count proceeds as 1, 2, 3, 4, 6, 7, 8, 9, . . . , then it is apparent that count number 5 is missing).
If the underlying sampling structure is such that the sampling is so irregular that there is no basic processing unit sampling rate, then what is needed is simply a good representation of true time for each processing unit. Normally however, in such a case there should at least be a common time clock against which the location of the processing unit can be referenced.
In either case (with regular or irregular sampling times), it is useful for a multimedia system to record and use timing information for the samples or frames or fields of each processing unit of the media content.
Different types of media may require different sampling rates. But if timing information is always stored with the same precision, a certain amount of rounding error may be introduced by the method used for representing time. It is desirable for the recorded time associated with each sample to be represented precisely in the system with little or no such rounding error. For example, if a media stream operates at 30,000/1001 frames per second (the typical frame rate of North American standard NTSC broadcast video—approximately 29.97 frames per second) and the precision of the time values used in the system is to one part in 10−6 seconds, then although the time values may be very precise in human terms, it may appear to processing elements within the system that the precisely-regular sample timing (e.g. 1001/30,000 seconds per sample) is not precisely regular (e.g. 33,366 clock increment counts between samples, followed by 33,367 increments, then 33,367 increments, and then 33,366 increments again). This can cause difficulties in determining how to properly handle the media samples in the system.
Another problem in finding a method to represent time is that the representation may “drift” with respect to true time as would be measured by a perfectly ideal “wall clock”. For example, if the system uses a precisely-regular sample timing of 1001/30,000 seconds per sample and all samples are represented with incremental time intervals being 33,367 increments between samples, the overall time used for a long sequence of such samples will be somewhat longer than the true time interval—a total of about one frame time per day and accumulating more than five minutes of error after a year of duration.
Thus, drift is defined as any error in a timecode representation of sampling times that would (if uncorrected) tend to increase in magnitude as the sequence of samples progresses.
One example of a method of representing timing information is found in the SMPTE 12M design [Society of Motion Picture and Television Engineers, Recommended Practice 12M: 1999] (hereinafter called “SMPTE timecode”). SMPTE timecodes are typically used for television video data with timing specified in the United States by the National Television Standards Committee (NTSC) television transmission format, or in Europe, by the Phase Alternating Line (PAL) television transmission format.
Background on SMPTE Timecode
SMPTE timecode is a synchronization signaling method originally developed for use in the television and motion picture industry to deal with video tape technology. The challenge originally faced with videotape was that there was no “frame accurate” way to synchronize devices for video or sound-track editing. A number of methods were employed in the early days, but because of the inherent slippage and stretching properties of tape, frame accurate synchronization met with limited success. The introduction of SMPTE timecode provided this frame accuracy and incorporated additional functionality. Additional sources on SMPTE include “The Time Code Handbook” by Cipher Digital Inc. which provides a complete treatment of the subject, as well as an appendix containing ANSI Standard SMPTE 12M-1986. Additionally, a text entitled “The Sound Reinforcement Handbook” by Gary Davis and Ralph Jones for Yamaha contains a section on timecode theory and applications.
The chief purpose of SMPTE timecode is to synchronize various pieces of equipment. The timecode signal is formatted to provide a system wide clock that is referenced by everything else. The signal is usually encoded directly with the video signal or is distributed via standard audio equipment. Although SMPTE timecode uses many references from video terminology, it is sometimes also used for audio-only applications.
In many applications, a timecode source provides the signal while the rest of the devices in the system synchronize to it and follow along. The source can be a dedicated timecode generator, or it can be (and often is) a piece of the production equipment that provides timecode in addition to its primary function. An example of this would be a multi-track audio tape deck that is providing timecode on one track and sound for the production on other tracks. Video tape often makes similar use of a cue track or one of its audio sound tracks to record and play back timecode.
In other applications, namely video, the equipment uses timecode internally to synchronize multiple timecode sources into one. An example would be a video editor that synchronizes with timecode from a number of prerecorded scenes. As each scene is combined with the others to make the final product, their respective timecodes are synchronized with new timecode being recorded to the final product.
SMPTE Time Address
SMPTE timecode provides a unique address for each frame of a video signal. This address is an eight digit number, based on the 24 hour clock and the video frame rate, representing Hours, Minutes, Seconds and Frames in the following format: