The present invention relates to a system for encoding, sending, and receiving audio and video that includes auxiliary data.
Digitally encoding video, consisting of multiple sequential frames of images, is a common task that traditionally has been performed by dividing each frame into a set of pixels and defining a sample luminance value, and possibly a color sample value, for each pixel. The development of powerful inexpensive computer technology has made it cost effective to develop complex systems for encoding moving pictures to achieve significant data compression. This permits, for example, high definition television (HDTV) to be broadcast within a limited bandwidth. The international standard which has been chosen for the transmission of HDTV is the Moving Pictures Experts Group 2 (MPEG-2) standard. The MPEG-2 video compression standard is based on both intra and inter coding of video fields or frames and achieves compression by taking advantage of either spatial or temporal redundancy in the video sequence.
An additional video encoding system has been proposed, designated as the MPEG-4 standard. MPEG-4 is a system of encoding video by which a moving picture composed of a sequence of sequential frames, generally accompanied by audio information, may be transmitted or otherwise sent from a moving picture sending device (“sender”) to a moving picture receiving device (“receiver”) in an efficient and flexible manner. Interactive applications are also anticipated in the MPEG-4 standard, but for ease of description, this patent application describes the case in which a moving picture is sent from the sender to the receiver only. It is to be understood that this patent application includes the case in which control information may be sent from receiver to sender.
MPEG-4 supports the transmission of a “Binary Format for Scene Description” (BIFS) which specifies the composition of objects in a sequence of frames or fields representing a three dimensional scene. In BIFS a scene is divided into a plurality of “scene elements.” For example, as shown in FIG. 1A, a scene with a presenter, a blackboard, a desk, a globe, and an audio accompaniment is broken up into scene elements in panel 12, and reconstituted into a complete scene, with all the scene elements properly arranged in panel 14. In other words, each frame of a MPEG-4 based video system is composed of a plurality of scene elements that are arranged according to directives specified in the associated BIFS information. As such MPEG-4, similar to MPEG-2, is directed to a system for encoding and transmitting video and audio information for viewing a movie, albeit using a different protocol than MPEG-2. An example of a suitable systems decoder for MPEG-4 is shown in FIG. 1B.
Referring to FIG. 2, each scene element 20 is represented by an object to which is assigned a node 26 in a generally hierarchical tree structure (panel 16). The scene elements requiring input streaming data include an object descriptor 14. The tree structure provides a convenient data structure for representing a sequence of video fields or frames. Each object descriptor in turn includes at least one elementary stream descriptor 28 which includes such information as the data rate, location information regarding the associated data stream(s) and decoding requirements for its respective logical channel (described below) for updating information regarding the data object. The data stream sent through a particular pipe for a particular object descriptor is generally referred to as an elementary stream. Such information may include, for example, data describing a change in shape, colorization, brightness, and location. Every elementary stream descriptor includes a decoder type (or equivalent stream type structure) which publishes the format or encoding algorithm used to represent the transmitted elementary stream data. FIGS. 3 and 4 show a simplified object descriptor format and an elementary stream descriptor format, respectively.
In an MPEG-4 system, both the sender and the receiver can use a respective “Delivery Multimedia Integration Framework” (DMIF) for assigning the data channels and time multiplexing scheme based on requests from the MPEG-4 applications. The DMIF, in effect, acts as an interface between an MPEG-4 application and the underlying transport layer, sometimes referred to herein as a “data pipe.” The pipes generally refer to the physical or logical interconnection between a server and a client. As an object descriptor is sent from the sender to the receiver, the sender DMIF examines each of its elementary stream descriptors 28 and assigns a pipe to each one based on the requirements 30, 32, 34 and 36 included in the elementary stream descriptors 28. For example, an elementary stream with a more exacting quality of stream requirement 34 would be assigned a pipe that would permit high quality transmission. The pipe assignment is sent to the receiving DMIF. For example, depending on the transmission system each pipe (logical channel) could be a different periodic time portion of a single connection; each pipe could be different connection through the internet; or each pipe could a be different telephone line. In general, DMIF establishes the connections and ensures that both the sender and the receiver are in synchronization for tying a particular pipe to a particular elementary stream at both the sending and the receiving end.
The DMIF also assigns an association tag to each elementary stream. This tag provides a single identification of each elementary stream from the sending DMIF to the receiving DMIF and vice versa. Accordingly, the association tag is a unique end-to-end DMIF pipe identifier. Any channel change or reassignment initiated by the sender or receiver (for an interactive system) would require a different association tag.
Referring to FIG. 1B, DMIF delivers data to an MPEG-4 application by way of “Access Unit Layer Protocol Data Units” (AL-PDUs). The AL-PDUs configuration descriptor 26 in the elementary stream descriptor 28 (FIG. 4) specifies the AL-PDUs time coordination and required buffer size characteristics. The purpose is to ensure that AL-PDUs will arrive in a timely manner and conveying enough data to permit the data object to be fully updated for timely display. The AL-PDUs are stored in decode buffers, are reassembled into access units, are decoded (in accordance with the Decoder Type associated with the elementary stream), and stored in a configuration buffer, to be placed into the scene by the compositor which arranges the scene elements of the scene. DMIF also configures the multiplexing and access layer for each elementary stream based on its AL-PDU Configuration Descriptor. When DMIF is not used (typically in the case of a storage network), the association tag is substituted for a channel number identifying a logical or a physical channel containing the elementary stream.
The process of sending a video stream begins with a moving picture request from the receiver application (in a broadcast system the request goes no further than the receiver DMIF), followed by a sequence of data from a sender application designed to establish an initial scene configuration. In FIG. 4, one of the decoder types is for scene description and another is for object descriptors. In the initial scene configuration a number of object descriptors are typically established. Each data object descriptor includes at least one elementary stream descriptor.
Another issue faced in an MPEG-4 system is avoiding collision among elementary stream identifiers when two applications run simultaneously on top of the same DMIF session. Each elementary stream ID is unique within an application. Therefore, the underlying DMIF session must be able to distinguish requests from each application in the case where elementary stream ID values collide.
Hamaguchi U.S. Pat. No. 5,276,805 discloses an image filing system in which retrieval data for a first set of images is associated with image data for a first image that is logically associated with the first set of images. For example, a user viewing an x-ray image of a patient's kidney would be able to view a list of a set of x-ray images taken from the same patient and be able to quickly view any of the set of x-ray images upon request. Hamaguchi teaches that the retrieval information is associated with the image as a whole.
Judson U.S. Pat. No. 5,572,643 discusses an already existing technology, known as hypertext, by which an internet browser permits a user to access an internet URL by clicking on a highlighted word in a page accessed on the internet. Judson teaches that the retrieval information is associated only with the test. Cohen et al. U.S. Pat. No. 5,367,621 is quite similar to Judson, permitting a user to click on a word in an on-line book, thereby causing a multimedia object to be displayed. Both Judson and Cohen et al. disclose advances linking together data which is unrelated to the encoding and data compression necessary for MPEG-2 and MPEG-4.