The rapid growth of Internet and the popularity of digital image capture devices, such as digital video camera, digital camera, and digital video recorder, provide more channels for the user to easily obtain multimedia data. As the user acquires more multimedia data, the difficulty of multimedia data management also increases.
Although there exist several multimedia data management methods and systems, most of the existent methods and systems use text or language-to-text to describe, index and retrieve the multimedia data. The current multimedia data annotation and search technologies can be categorized as four types: text-based annotation and search, speech-to-text-based annotation and search, graphical analysis search, and speech annotation and search.
The text-based annotation and search method is simple, but has the disadvantages of requiring long text input and being constrained by the system keywords in the annotation and search process. U.S. Pat. No. 6,833,865 disclosed an embedded metadata engine in digital capture devices. By adding the image content analysis function to the digital image capture device, the extra information related to the contents can be automatically generated through the image content analysis function, and stored with the original image data. However, this patent is only suitable for dynamically generating annotation for the image, but did not disclose any method for searching images.
The speech-to-text-based method requires speech recognition device, which leads to the language-related constraints. U.S. Pat. No. 6,397,181 disclosed a method and apparatus for voice annotation and retrieval of multimedia data. The speech input is used in annotation, and a speech recognition technique is used to transform the speech into text. The text annotation is used to generate a reverse index table. The search also uses speech input, which is used to generate a search keyword through speech recognition technique. The reverse index table is then used to find the matching multimedia data.
U.S. Pat. No. 6,499,016 disclosed a method for automatically storing and presenting digital images using a speech-based command language. The speech-to-text approach is used in annotation, and the text is used in search. The user can use the speech to annotate the picture in real-time when using a digital camera. With a plurality of commands, statement speech input, the annotation, such as time and place, can be sent to the server with the image. The server uses speech recognition to transform the speech to the text for storage. Based on the text annotation, the user can use a keyword to dynamically generate a photo album for viewing.
U.S. Pat. No. 6,813,618 disclosed a system and method for acquisition of related graphical material in a digital graphics album. The patent uses text search to achieve the object of finding a graphic with another graphic. The user can search the network to find the related images.
To use the graphical analysis in search, the system requires the capability of graphical analysis. Although the user does not need to annotate each picture, the user can only search for graphics, and the user must first find the graphic to use for the basis for the search; therefore, it is difficult to precisely analyze the graphical contents. The article “An active Learning Framework for Content-Based Information Retrieval” in Multimedia, IEEE Transactions on Vol. 4, Issue 2, June 2002, pp. 260-268, disclosed a content-based information retrieval technique to construct an attribute tree for marking the images.
There are several methods using voice search, including direct comparison between the search condition and the annotated original voice data, or using voice recognition to transform the voice into N-gram combination to construct an index vector, and then performing voice indexing. The former requires a large amount of time in comparison when the data volume is large, and the latter is restricted by the language-dependent characteristics.
Although all the above four types of multimedia data annotation and retrieval technologies are used, all these technologies have their respective issues as stated above, and are language-dependent, therefore, the user is restricted to the use of certain languages or voices.
It is, therefore, imperative to provide a simplified data management method, a fast mechanism to search for multimedia data, and a voice and language independent indexing and searching method.