Multimedia search, retrieval and filtering applications need a standard reference system to specify regions, locations and sizes along the dimensions of space, time, resolution, and frequency. For example, this requirement is being addressed by MPEG-7, which is standardizing the interface for multimedia content search and retrieval. The objective of MPEG-7 is to improve the ability by which audio-visual content is indexed, searched, browsed, filed and filtered in a large number of multimedia storage and retrieval applications.
The audio-visual search and filtering applications need to deal with the dimensions of space, time, resolution, and frequency explicitly, such as when referring to spatial regions of images, temporal units of video, low-resolution versions of images, or frequency bands of audio. The applications also need to be able to refer to regions and locations in space, time, resolution and frequency without concern for the underlying storage and compression formats for the image, video and audio data. The applications need this higher-level interface to insulate them from the specific details of the data bit-streams and view extraction methods. In order to enable this abstraction, a description scheme is needed for specifying location, size and division along the dimensions of space, time, frequency and resolution as they pertain to the image, video and audio data, and for relating the views of the data with the data storage formats.
Independent of the compression and storage formats, digital multimedia content such as images, video and audio has an inherent lattice structure. For example, image data consists of pixels which are samples on a 2-D spatial grid. Similarly, audio and video data consists of samples in time, or in space and time, respectively. Spatial and temporal views, such as spatial quadrants of an image or temporal segments of video, correspond to the data in parts of the lattice. Frequency views, such as low-resolution images, wavelet subbands, or temporal-frequency channels, correspond to segments of the data after having been transformed into the frequency domain. Together, the notion of segmentation in space and frequency permits the development of a general notion of views of lattice date as follows: a view is a region in multi-dimensional space (including time) and multi-dimensional frequency (including resolution) that has location and size. Since audiovisual applications need to deal with views all the time, a description scheme is needed to describe views.
Previously, methods have been developed for describing regions of images. The TIFF image format provides the aide to specify tiling of images in space, as described in the TIFF Revision 6.0 by the Adobe Developers Association (Jun. 23, 1992). In the TIFF image format, each tile corresponds to a rectangular region of the image. In another approach, the Flashpix image format provides the ability to specify tiling of images at a finite number of resolutions, as detailed by Eastman Kodak Company authors in "FlashPix Formal and Architecture White Paper" (Jun. 17, 1996). Each Flashpix tile corresponds to a rectangular region of the image at a particular resolution. More general tilings can have location and size in space and frequency as demonstrated in the space and frequency graph, as taught by J. R Smith and S. -F. Chang, in "Space and Frequency Adaptive Wavelet Packets", Proc. IFEP. Intern. Conf on Acoustics, Speech and Signal Proc., ICASSP-95, Detroit, Mich., (May 1995) and in "Joint Adaptive Space and Frequency Basis Selection", Proc. IEEE Intern. Conf on Image Processing, ICIP-97, Santa Barbara, Calif., (October 1997). In the space and frequency graph, each view element corresponds to a multi-dimensional rectangular region in space and frequency that corresponds to the data of the image belonging to a spatial region and a frequency range.
While these examples illustrate specific types of decompositions of image and video data, they do not provide a general system for describing views produced by decompositions in space and frequency. A system is needed in order to enable interoperability between image and video systems in dealing with a diversity of decompositions and views of the data