1. Field of the Invention
The present invention relates generally to marking multimedia files. More specifically, the present invention relates to applying or inserting tags into multimedia files for indexing and searching, as well as for editing portions of multimedia files, all to facilitate the storing, searching, and retrieving of the multimedia information.
2. Background of the Related Art
1. Multimedia Bookmarks
With the phenomenal growth of the Internet, the amount of multimedia content that can be accessed by the public has virtually exploded. There are occasions where a user who once accessed particular multimedia content needs or desires to access the content again at a later time, possibly at or from a different place. For example, in the case of data interruption due to a poor network condition, the user may be required to access the content again. In another case, a user who once viewed multimedia content at work may want to continue to view the content at home. Most users would want to restart accessing the content from the point where they had left off. Moreover, subsequent access may be initiated by a different user in an exchange of information between users. Unfortunately, multimedia content is represented in a streaming file format so that a user has to view the file from the beginning in order to look for the exact point where the first user left off.
In order to save the time involved in browsing the data from the beginning, the concept of a bookmark may be used. A conventional bookmark marks a document such as a static web page for later retrieval by saving a link (address) to the document. For example, Internet browsers support a bookmark facility by saving an address called a Uniform Resource Identifier (URI) to a particular file. Internet Explorer, manufactured by the Microsoft Corporation of Redmond, Wash., uses the term “favorite” to describe a similar concept.
Conventional bookmarks, however, store only the information related to the location of a file, such as the directory name with a file name, a Universal Resource Locator (URL), or the URI. The files referred to by conventional bookmarks are treated in the same way regardless of the data formats for storing the content. Typically, a simple link is used for multimedia content also. For example, to link to a multimedia content file through the Internet, a URI is used. Each time the file is revisited using the bookmark, the multimedia content associated with the bookmark is always played from the beginning.
FIG. 1 illustrates a list 108 of conventional bookmarks 110, each comprising positional information 112 and title 114. The positional information 112 of a conventional bookmark is composed of a URI as well as a bookmarked position 106. The bookmarked position is a relative time or byte position measured from a beginning of the multimedia content. The title 114 can be specified by a user, as well as delivered with the content, and it is typically used to make the user easily recognize the bookmarked URI in a bookmark list 108. For the case of a conventional bookmark without using a bookmarked position, when a user wants to replay the specified multimedia file, the file is played from the beginning of the file each time, regardless of how much of the file the user has already viewed. The user has no choice but to record the last accessed position on a memo and to move manually the last stopped point. If the multimedia file is viewed by streaming, the user must go through a series of buffering to find out the last accessed position, thus wasting much time. Even for the conventional bookmark with a bookmarked position, the same problem occurs when the multimedia content is delivered in live broadcast, since the bookmarked position within the multimedia content is not usually available, as well as when the user wants to replay one of the variations of the bookmarked multimedia content.
Further, conventional bookmarks do not provide a convenient way of switching between different data formats. Multimedia content may be generated and stored in a variety of formats. For example, video may be stored in the formats such as MPEG, ASF, RM, MOV, and AVI. Audio may be stored in the formats such as MID, MP3, and WAV. There may be occasions where a user wants to switch the play of content from one format to another. Since different data formats produced from the same multimedia content are often encoded independently, the same segment is stored at different temporal positions within the different formats. Since conventional bookmarks have no facility to store any content information, users have no choice but to review the multimedia content from the beginning and to search manually for the last-accessed segment within the content.
Time information may be incorporated into a bookmark to return to the last-accessed segment within the multimedia content. The use of time information only, however, fails to return to exactly the same segment at a later time for the following reasons. If a bookmark incorporating time information was used to save the last-accessed segment during the preview of multimedia content broadcast, the bookmark information would not be valid during a regular full-version broadcast, so as to return to the last-accessed segment. Similarly, if a bookmark incorporating time information was used to save the last-accessed segment during real-time broadcast, the bookmark would not be effective during later access because the later available version may have been edited or a time code was not available during the real-time broadcast.
Many video and audio archiving systems, consisting of several differently compressed files called “variations”, could be produced from a single source multimedia content. Many web-casting sites provide multiple streaming files for a single video content with different bandwidths according to each video format. For example, CNN.com provides five different streaming videos for a single video content: two different types of streaming videos with the bandwidths of 28.8 kbps and 80 kbps, both encoded in Microsoft's Advanced Streaming Format (ASF). CNN.com also provides RM streaming format by RealNetworks, Inc. of Seattle, Wash. (RM), and a streaming video with the smart bandwidth encoded in Apple Computer, Inc.'s QuickTime streaming format (MOV). In this case, the five video files may start and end at different time points from the viewpoint of the source video content, since each variation may be produced by an independent encoding process varying the values chosen for encoding formats, bandwidths, resolutions, etc. This results in mismatches of time points because a specific time point of the source video content may be presented as different media time points in the five video files.
When a multimedia bookmark is utilized, the mismatches of positions cause a problem of mis-positioned playback. Consider a simple case where one makes a multimedia bookmark on a master file of a multimedia content (for example, video encoded in a given format), and tries to play another variation (for example, video encoded in a different format) from the bookmarked position. If the two variations do not start at the same position of the source content, the playback will not start at the bookmarked position. That is, the playback will start at the position that is temporally shifted with the difference between the start positions of the two variations.
The entire multimedia presentation is often lengthy. However, there are frequent occasions when the presentation is interrupted, voluntarily or forcibly, to terminate before finishing. Examples include a user who starts playing a video at work leaves the office and desires to continue watching the video at home, or a user who may be forced to stop watching the video and log out due to system shutdown. It is thus necessary to save the termination position of the multimedia file into persistent storage in order to return directly to the point of termination without a time-consuming playback of the multimedia file from the beginning.
The interrupted presentation of the multimedia file will usually resume exactly at the previously saved terminated position. However, in some cases, it is desirable to begin the playback of the multimedia file a certain time before the terminated point, since such rewinding could help refresh the user's memory.
In the prior art, the EPG (Electronic Program Guide) has played a crucial role as a provider of TV programming information. EPG facilitates a user's efforts to search for TV programs that he or she wants to view. However, EPG's two-dimensional presentation (channels vs. time slots) becomes cumbersome as terrestrial, cable, and satellite systems send out thousands of programs through hundreds of channels. Navigation through a large table of rows and columns in order to search for desired programs is frustrating.
One of the features provided by the recent set-top box (STB) is the personal video recording (PVR) that allows simultaneous recording and playback. Such STB usually contains digital video encoder/decoder based on an international digital video compression standard such as MPEG-1/2, as well as the large local storage for the digitally compressed video data. Some of the recent STBs also allow connection to the Internet. Thus, STB users can experience new services such as time-shifting and web-enhanced television (TV).
However, there still exist some problems for the PVR-enabled STBs. The first problem is that even the latest STBs alone cannot fully satisfy users' ever-increasing desire for diverse functionalities. The STBs now on the market are very limited in terms of computing and memory and so it is not easy to execute most CPU and memory intensive applications. For example, the people who are bored with plain playback of the recorded video may desire more advanced features such as video browsing/summary and search. Actually, all of those features require metadata for the recorded video. The metadata are usually the data describing content, such as the title, genre and summary of a television program. The metadata also include audiovisual characteristic data such as raw image data corresponding to a specific frame of the video stream. Some of the description is structured around “segments” that represent spatial, temporal or spatio-temporal components of the audio-visual content. In the case of video content, the segment may be a single frame, a single shot consisting of successive frames, or a group of several successive shots. Each segment may be described by some elementary semantic information using texts. The segment is referenced by the metadata using media locators such as frame number or time codes. However, the generation of such video metadata usually requires intensive computation and a human operator's help, so practically speaking, it is not feasible to generate the metadata in the current STB. Thus, one possible solution for this problem is to generate the metadata in the server connected to the STB and to deliver it to the STB via network. However, in this scenario, it is essential to know the start position of recorded video with respect to the video stream used to generate the metadata in the server/content provider in order to match the temporal position referenced by the metadata to the position of the recorded video.
The second problem is related to discrepancy between the two time instants: the time instant at which the STB starts the recording of the user-requested TV program, and the time instant at which the TV program is actually broadcast. Suppose, for instance, that a user initiated PVR request for a TV program scheduled to go on the air at 11:30 AM, but the actual broadcasting time is 11:31 AM. In this case, when the user wants to play the recorded program, the user has to watch the unwanted segment at the beginning of the recorded video, which lasts for one minute. This time mismatch could bring some inconvenience to the user who wants to view only the requested program. However, the time mismatch problem can be solved by using metadata delivered from the server, for example, reference frames/segment representing the beginning of the TV program. The exact location of the TV program, then, can be easily found by simply matching the reference frames with all the recorded frames for the program.
2. Search
The rapid expansion of the World Wide Web (WWW) and mobile communications has also brought great interest in efficient multimedia data search, browsing and management. Content-based image retrieval (CBIR) is a powerful concept for finding images based on image contents, and content-based image search and browsing have been tested using many CBIR systems. See, M. Flickner, Harpreet Sawhney, Wayne Niblack, Jonathan Ashley, Q. Huang, Byron Dom, Monika Gorkani, Jim Hafine, Denis Lee, Dragutin Petkovic, David Steele and Peter Yanker, “Query by image and video content: The QBIC system,” IEEE Computer, Vol. 28. No. 9, pp. 23-32, September, 1995; Carson, Chad et al., “Region-Based Image Querying [Blobworld],” Workshop on Content-Based Access of Image and Video Libraries, Puerto Rico, June 1997; J. R. Smith and S. Chang, “Visually searching the web for content,” IEEE Multimedia Magazine, Vol. 4, No. 3, pp. 12-20, Summer 1997, also Columbia U. CU/CTR Technical Report 459-96-25; A. Pentland, R. W. Picard and S. Sclaroff, “A Photobook: tools for content-based manipulation of image databases,” in Proc. Of SPIE Conf. On Storage and Retrieval for Image and Video Databases-II, No. 2185, pp. 34-47, San Jose, Calif., February, 1944; J. R. Bach, C. Fuller, A. Guppy, A. Hampapur, B. Horowitz, R. Humphrey, R. C. Jain and C. Shu, “Virage image search engine: an open framework for image management,” Symposium on Electronic Imaging: Science and Technology—Storage & Retrieval for Image and Video Databases IV, IS&T/SPIE'96, February, 1996; J. R. Smith and S. Chang, “VisualSEEk: A Fully Automated Content-Based Image Query System,” ACM Multimedia Conference, Boston, Mass., November, 1996; Jing Huang, S. Ravi Kumar, Mandar Mitra, Wei-Jing Zhu and Ramin Zabih. “Image Indexing Using Color Correlograms,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 762-768, June, 1997; and Simone Santini, and Ramesh Jain, “The ‘El Nino’ Image Database System,” in International Conference on Multimedia Computing and Systems, pp. 524-529, June, 1999.
Currently, most of the content-based image search engines rely on low-level image features such as color, texture and shape. While high-level image descriptors are potentially more intuitive for common users, the derivation of high-level descriptors is still in its experimental stages in the field of computer vision and requires complex vision processing. Despite its efficiency and ease of implementation, on the other hand, the main disadvantage of low-level image features is that they are perceptually non-intuitive for both expert and non-expert users, and therefor, do not normally represent users' intent effectively. Furthermore, they are highly sensitive to a small amount of image variation in feature shape, size, position, orientation, brightness and color. Perceptually similar images are often highly dissimilar in terms of low-level image features. Searches made by low-level features are often unsuccessful and it usually takes many trials to find images satisfactory to a user.
Efforts have been made to overcome the limitations of low-level features. Relevance feedback is a popular idea for incorporating user's perceptual feedback in the image search. See, Y. Rui, T. Huang, and S. Mehrota, “A relevance feedback architecture in content-based multimedia information retrieval systems,” in IEEE Workshop on Content-based Access of Image and Video Libraries, Puerto Rico, pp. 82-89, June, 1997; Yong Rui, Thomas S. Huang, Michael Ortega, and Sharad Mehrotra, “Relevance Feedback: A Power Tool in Interactive Content-Based Image Retrieval,” in IEEE Tran on Circuits and Systems for Video Technology, Special Issue on Segmentation, Description, and Retrieval of Video Content, pp. 644-655, Vol. 8, No. 5, September, 1998; G. Aggarwal, P. Dubey, S. Ghosal, A. Kulshreshtha, and A. Sarkar, “iPURE: perceptual and user-friendly retrieval of images,” in Proc. of IEEE International Conference on Multimedia and Exposition, Vol. 2, pp. 693-696, July, 2000; Ye Lu, Chunhui Hu, Xingquan Zhu, HongJiang Zhang and Qiang Yang, “A unified framework for semantics and feature based relevance feedback in image retrieval systems,” in Proc. of ACM International Conference on Multimedia, pp. 31-37, October, 2000; H. Muller, W. Muller, S. Marchand-Maillet, and T. Pun, “Strategies for positive and negative relevance feedback in image retrieval,” in Proc. of IEEE Conference on Pattern Recognition, Vol. 1, pp. 1043-1046, September, 2000; S. Aksoy, R. M. Haralick, F. A. Cheikh, and M. Gabbouj, “A weighted distance approach to relevance feedback,” in Proc. of IEEE Conference on Pattern Recognition, Vol. 4, pp. 812-815, September, 2000; I. J. Cox, M. L. Miller, T. P. Minka, T. V. Papathomas, and P. N. Yianilos, “The Bayesian image retrieval system, PicHunter:theory, implementation, and psychophysical experiments,” in IEEE Transaction on Image Processing, Vol. 9, pp. 20-37, January, 2000; P. Muneesawang, and Guan Ling, “Multi-resolution-histogram indexing and relevance feedback learning for image retrieval,” in Proc. of IEEE International Conference on Image Processing, Vol. 2, pp. 526-529, January, 2001. A user can manually establish relevance between a query and retrieved images, and the relevant images can be used for refining the query. When the refinement is made by adjusting a set of low-level feature weights, however, the user's intent is still represented by low-level features and their basic limitations still remain.
Several approaches have been made to the integration of human perceptual responses and low-level features in image retrieval. One notable approach is to adjust an image's feature's distance attributes based on the human perceptual input. See, Simone Santini, and Ramesh Jain, “The ‘El Nino’ Image Database System,” in International Conference on Multimedia Computing and Systems, pp. 524-529, June, 1999. Another approach, called “blob world,” combines low-level features to derive slightly higher-level descriptions and presents the “blobs” of grouped features to a user to provide a better understanding of feature characteristics. See, Carson, Chad, et al., “Region-Based Image Querying [Blobworld],” Workshop on Content-Based Access of Image and Video Libraries, Puerto Rico, June, 1997. While those schemes successfully reflect a user's intent to some degree, it remains to be seen how grouping of features or feature distance modification can achieve the perceptual relevance in image retrieval. A more traditional computer vision approach to the derivation of high-level object descriptors based on generic object recognition has been presented for image retrieval. See, David A. Forsyth and Margaret Fleck, “Body Plans,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 678-683, June, 1997. Due to its limited feasibility for general image objects and complex processing, its utility is still restricted.
With the rapid proliferation of large image/video databases, there has been an increasing demand for effective methods to search the large image/video databases automatically by their content. For a query image/video clip given by a user, these methods search the databases for the images/videos that are most similar to the query. In other words, the goal of the image/video search is to find best matches to the query image/video from the database.
Several approaches have been made towards the development of the fast, effective multimedia search methods. Milanes et al. utilized hierarchical clustering to organize an image database into visually similar groupings. See, R. Milanese, D. Squire, and T. Pun, “Correspondence analysis and hierarchical indexing for content-based image retrieval,” in Proc. IEEE Int. Conf. Image Processing, Vol. 3, Lausanne, Switzerland, pp. 859-862, September, 1996. Zhang and Zhong provided a hierarchical self-organizing map (HSOM) method to organize an image database into a two-dimensional grid. See, H. J. Zhang and D. Zhong, “A scheme for visual feature based image indexing,” in Proc. SPIE/IS&T Conf. Storage Retrieval Image Video Database III, Vol. 2420, pp. 36-46, San Jose, Calif., February, 1995. However, a weakness of HSOM is that it is generally too computationally expensive to apply to a large multimedia database.
In addition, there are other well known solutions using Voronoi diagram, Kd-tree, and R-tree. See, J. Bentley, “Multidimensional binary search trees used for associative searching,” Comm. of the ACM, Vol. 18, No. 9, pp. 509-517, 1975; S. Brin, “Near neighbor search in large metric spaces,” in Proc. 21st Conf. On Very Large Databases (VLDB'95), Zurich, Switzerland, pp. 574-584, 1995. However, it is also known that those approaches are not adequate for the high dimensional feature vector spaces, and thus, they are useful only in low dimensional feature spaces.
Peer to Peer Searching
Peer-to-Peer (P2P) is a class of applications making the most of previously unused resources (for example, storage, content, and/or CPU cycles), which are available on the peers at the edges of networks. P2P computing allows the peers to share the resources and services, or to aggregate CPU cycles, or to chat with each other, by direct exchange. Two of the more popular implementations of P2P computing are Napster and Gnutella. Napster has its peers register files with a broker, and uses the broker to search for files to copy. The broker plays the role of server in a client-server model to facilitate the interaction between the peers. Gnutella has peers register files with network neighbors, and searches the P2P network for files to copy. Since this model does not require a centralized broker, Gnutella is considered to be a true P2P system.
3. Editing
In the prior art, video files were edited through video editing software by copying several segments of the input videos and pasting them to an output video. The prior art method, however, confronts two major problems mentioned below.
The first problem of the prior art method is that it requires additional storage to store the new version of an edited video file. Conventional video editing software generally uses the original input video file to create an edited video. In most of the cases, editors having a large database of videos attempt to edit the videos to create a new one. In this case, the storage is wasted storing duplicated portions of the video. The second problem with the prior art method is that a whole new metadata have to be generated for a newly created video. If the metadata are not edited in accordance with the edition of the video, even if the metadata for the specific segment of the input video are already constructed, the metadata may not accurately reflect the content. Because considerable effort is required to create the metadata of videos, it is desirable to reuse efficiently existing metadata, if possible.
Metadata of a video segment contain textual information such as time information (for example, starting frame number and duration, or starting frame number as well as the finishing frame number), title, keyword, and annotation, as well as image information such as the key frame of a segment. The metadata of segments can form a hierarchical structure where the larger segment contains the smaller segments. Because it is hard to store both the video and their metadata into a single file, the video metadata are separately stored as a metafile, or stored in a database management system (DBMS).
If metadata having a hierarchical structure are used, browsing a whole video, searching for a segment using the keyword and annotation of each segment, and using the key frames of each segment for visual summary of the video are supported. Also, not only does it support the existing simple playback, but also the playback and repeated playback of a specific segment. Therefor, the use of hierarchically-structured metadata is becoming popular.
4. Transcoding
With the advance of information technology, such as the popularity of the Internet, multimedia presentation proliferates into ever increasing kinds of media, including wireless media. Multimedia data are accessed by ever increasing kinds of devices such as hand-held computers (HHCs), personal digital assistants (PDAs), and smart cellular phones. There is a need for accessing multimedia content in a universal fashion from a wide variety of devices. See, J. R. Smith, R. Mohan and C. Li, “Transcoding Internet Content for Heterogeneous Client Devices,” in Proc. ISCASA, Monterey, Calif., 1998.
Several approaches have been made to enable effectively such universal multimedia access (UMA). A data representation, the InfoPyramid, is a framework for aggregating the individual components of multimedia content with content descriptions, and methods and rules for handling the content and content descriptions. See, C. Li, R. Mohan and J. R. Smith, “Multimedia Content Description in the InfoPyramid,” in Proc. IEEE Intern. Conf. on Acoustics, Speech and Signal Processing, May, 1998. The InfoPyramid describes content in different modalities, at different resolutions and at multiple abstractions. Then a transcoding tool dynamically selects the resolutions or modalities that best meet the client capabilities from the InfoPyramid. J. R. Smith proposed a notion of importance value for each of the regions of an image as a hint to reduce the overall data size in bits of the transcoded image. See, J. R. Smith, R. Mohan and C. Li, “Content-based Transcoding of Images in the Internet,” in Proc. IEEE Intern. Conf. on Image Processing, October, 1998; S. Paek and J. R. Smith, “Detecting image Purpose in World-Wide Web Documents,” in Proc. SPIE/IS&T Photonics West, Document Recognition, January, 1998. The importance value describes the relative importance of the region/block in the image presentation compared with the other regions. This value ranges from 0 to 1, where 1 stands for the highest important region and 0 for the lowest. For example, the regions of high importance are compressed with a lower compression factor than the remaining part of the image. Then, the other parts of the image are first blurred and then compressed with a higher compression factor in order to reduce the overall data size of the compressed image.
When an image is transmitted to a variety of client devices with different display sizes, a scaling mechanism, such as format/resolution change, bit-wise data size reduction, and object dropping, is needed. More specifically, when an image is transmitted to a variety of client devices with different display sizes, a system should generate a transcoded (e.g., scaled and cropped) image to fit the size of the respective client display. The extent of transcoding depends on the type of objects embedded in the image, such as cards, bridges, face, and so forth. Consider, for example, an image containing an embedded text or a human face. If the display size of a client device is smaller than the size of the image, sub-sampling and/or cropping to fit the client display must reduce the spatial resolution of the image. Users very often in such a case have difficulty in recognizing the text or the human face due to the excessive resolution reduction. Although the importance value may be used to provide information on which part of the image can be cropped, it does not provide a quantified measure of perceptibility indicating the degree of allowable transcoding. For example, the prior art does not provide the quantitative information on the allowable compression factor with which the important regions can be compressed while preserving the minimum fidelity that an author or a publisher intended. The InfoPyramid does not provide either the quantitative information about how much the spatial resolution of the image can be reduced or ensure that the user will perceive the transcoded image as the author or publisher initially intended.
5. Visual Rhythm
Fast Construction of Visual Rhythm
Once the digital video is indexed, more manageable and efficient forms of retrieval may be developed based on the index that facilitate storage and retrieval. Generally, the first step for indexing and retrieving of visual data is to temporally segment the input video, that is, to find shot boundaries due to camera shot transitions. The temporally segmented shots can improve the storing and retrieving of visual data if keywords to the shots are also available. Therefor, a fast and accurate automatic shot detector needs to be developed as well as an automatic text caption detector to automatically annotate keywords to the temporally segmented shots.
Even if abrupt scene changes are relatively easy to detect, it is more difficult to identify special effects, such as dissolve and wipe. Unfortunately, these special effects are normally used to stress the importance of the scene change (from a content point of view), so they are extremely relevant therefor they should not be missed. However, the wipe sequence detection method, relative to dissolve sequence, is less discussed and concerned. For scene change detection, a matching process between two consecutive frames is required. In order to segment a video sequence into shots a dissimilarity measure between two frames must be defined. This measure must return a high value only when two frames fall in different shots. Several researchers have used the dissimilarity measure based on the luminance or color histogram, correlogram, or any other visual feature to match two frames. However, these approaches usually produce many false alarms and it is very hard for humans to exactly locate various types of shots (especially dissolves and wipes) of a given video even when the dissimilarity measure between two frames are plotted, for example when they are plotted in 1-D graph where the horizontal axis represents time of a video sequence and the vertical axis represents the dissimilarity values between the histograms of the frames along time. They also require high computation load to handle different shapes, directions and patterns of various wipe effects. Therefor, it is important to develop a tool that enables human operator to efficiently verify the results of automatic shot detection where there usually might be many falsely detected and missing shots. Visual rhythm satisfies much of the above conditions.
Visual rhythm contains distinctive patterns or visual features for many type of video editing effects, especially for all wipe-like effects which manifest as visually distinguishable lines or curves on the visual rhythm with very little computational time, which enables an easy verification of automatically detected shots by human without actually playing the whole individual frame sequence to minimize or possible eliminate all false as well as missing shots. Visual rhythm on the other hand contains visual features readily available to detect caption text also. See, H. Kim, J. Lee and S. M. Song, “An efficient graphical shot verifier incorporating visual rhythm”, in Proceedings of IEEE International Conference on Multimedia Computing and Systems, pp. 827-834, June, 1999.
Detecting Text in Video and Graphic Images
As contents become readily available on wide area networks such as the Internet, archiving, searching, indexing and locating desired content in large volumes of multimedia containing image and video, in addition to the text information, will become even more difficult. One important source of information about image and video is the text contained therein. The video can be easily indexed if access to this textual information content is available. The text provides clear semantics of video and are extremely useful in deducing the contents of video.
There are many ways that segment and recognize text in printed documents. Current video research tackles the text caption recognition problem as a series of sub-problems to: (a) identify the existence and location of text captions in complex background; (b) segment text regions; and (c) post-process the text regions for recognition using a standard OCR. Most current research focuses on tackling sub-problems (a) and (b) in raw spatial domain, with a few methods that can be extended to compressed domain processing.
A large number of methods has been studied extensively in recent years to detect text frames in uncompressed images and video. Ohya et al. performed character extraction through local thresholding and detected character candidate regions by evaluating gray level differences between adjacent regions. See, J. Ohya, A, Shio and S. Akamatsu, “Recognizing Characters in Scene Image,” in IEEE Trans. On pattern Analysis and Machine Intelligence, Vol. 16, pp. 214-224. Haupmann and Smith used the spatial context of text and high contrast of text regions in scene images to merge large numbers of horizontal and vertical edges in spatial proximity to detect text. See, A. Haupmann, M. Smith, “Text, Speech, and Vision for Video Segmentation: The Informedia Project,” in AAAI Symposium on Computational Models for Integrating Language and Vision, 1995. Shim et al. introduced a generalized region labeling algorithm to find homogeneous regions for text extraction. See, J. Shim, C. Dorai and M. Smith, “Automatic Text Extraction from Video for Content-Based Annotation and Retrieval,” in Proc. ICPR, pp. 618-620, 1998. Manmatha showed the algorithm to detect and segment texts as regions of distinctive texture using pyramid technique for handling text fonts of different sizes. See, W. Manmatha, “Finding Text in Images,” in Proc. of ACM Int'l Conf. On Digital Libraries, 3-12. Lienhart and Stuber provided Split-and-Merge algorithm based on characteristics of artificial text to segment text. See, R. Lienhart, “Automatic Text Recognition for Video Indexing,” in Proc. Of ACM MM, pp. 11-20. Doermann and Kia used wavelet analysis and employed a multi-frame coherence approach to cluster edges into rectangular shape. See, L. Doermann, 0. Kia, “Automatic Text Detection and Tracking in Digital Video,” in IEEE Trans. On Image Processing, Vol. 9, pp. 147-156. Sato et al. adopted a multi-frame integration technique to separate static text from moving background. See, T. Sato, T. Kanade and S. Satoh, “Video OCR: Indexing Digital News Libraries by Recognition of Superimposed Captions,” in Multimedia Systems, Vol. 7, pp. 385-394.
Finally, several compressed domain methods have also been proposed to detect text regions. Yeo and Liu proposed a method for the detection of text caption events in video by modified scene change detection which cannot handle captions that gradually enter or disappear from frames. See, B. L. Yeo, “Visual Content Highlighting Visa Automatic Extraction of Embedded Captions on MPEG Compressed Video,” in SPIE/IS&T Symp. on Electronic Imaging Science and Technology, Vol. 2668, 1996. Zhong et al. examined the horizontal variations of AC values in DCT to locate text frames and examined the vertical intensity variation within the text regions to extract the final text frames. See, Y. Zhong, K. Karu and A. Jain, “Automatic captions localization in compressed video,” in IEEE Trans. On PAMI, 22(4), pp. 385-392. Zhong derived a binarized gradient energy representation directly from DCT coefficients which are subject to constraints on text properties and temporal coherence to locate text. See, Y. Zhong, “Detection of text captions in compressed domain video,” in Proc. Of Multimedia Information Retrieval Workshop ACM Multimedia'2000, November 201-204. However, most of the compressed domain methods restrict the detection of text in I-frames of a video because it is time-consuming to obtain the AC values in DCT for intra-frame coded frames.
There is, therefor, a need in the art for a method and system that will enable the tagging of multimedia images for indexing, editing, searching and retrieving. There is also a need in the art to enable the indexing of textual information that is embedded in graphical images or other multimedia data so that the text in the image can also be tagged, indexed, searched and retrieved, as is other textual information. Further, there is also a need in the art for editing multimedia data for display, indexing, and searching in ways the prior art does not provide.