Many consumers currently have access to vast collections of multimedia content, such as audio and video files, through wide area networks, such as the Internet and through digital media centers. There is also a scarcity of tools that consumers can use to easily manage and search such collections. In particular, tools for performing content-based indexing and then content retrieval based on a generated index, for multimedia content, is an ongoing challenge.
Most multimedia content is generated by taking a set of source tracks (such as audio, video or text tracks), applying a set of rendering transforms such as special effects, mixing, fading and text overlay, and then saving the resulting rendered content to a single output multimedia file. The output multimedia file is currently in one of a variety of container formats. Some popular container formats include ASF, MPG, MP3, MP4, WAV and OGG formats. Most of these container file formats support the inclusion of multiple coded streams, including video streams, audio streams and text (e.g., subtitle) streams.
Some current approaches to performing content-based indexing and retrieval involve the use of speech recognition, audio scene analysis and video scene analysis. Speech recognition is employed in an attempt to generate some type of transcription of speech in audio files that can be used for indexing. Audio scene analysis involves analyzing audio content of a multimedia file (other than speech or including speech) to generate scene descriptions that can be used in indexing. Video scene analysis involves employing various types of recognition algorithms to recognize video components that can be used in indexing.
Although the accuracies in each of these technologies are slowly approaching levels that are considered useful in indexing, wide spread use of these technologies has been hindered for a number of reasons. The high acoustic and visual complexity of many multimedia files can result in poor index quality. For instance, speech recognition error rates are much higher when recognizing audio files that have background music or sound effects added to them.
The large amount of computing required to perform content-based indexing has also been problematic. Currently, the processing power required is at least an order of magnitude higher than the processing required for text-based indexing. This has hindered searching for speech content using speech recognition in Internet search engines since the processing power required to index all the audio and video content on the internet is enormous. It has also limited the use of similar technologies for desktop search functions since execution of such functions drains CPU processing availability and interferes with the user's ability to use the desktop computer.
Similarly, this type of indexing has conventionally been employed on encoded multimedia files. When indexing is performed at that stage in production of the multimedia content, a relatively large amount of useful information is lost before the indexing analysis is performed. For instance, video compression and audio compression are conventionally performed on multimedia files and lower the quality of the files. This lower quality makes it difficult to perform content-based analysis for indexing.
As a result of these problems, a common current approach to indexing multimedia files is to simply index support metadata. For instance, some current systems index metadata such as file names and any other accompanying textual descriptions found in the metadata. Of course, this type of metadata indexing is very limited and thus hinders the quality of the index generated and the quality of content retrieval performed by searching against the index.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.