1. Field of the Invention
This disclosure relates to a system for enabling search of content, and, more particularly, to a system which can enable indexing of, and queries upon, opaque content.
2. Description of the Related Art
Referring to FIG. 1, the World Wide Web (“WWW”) is a distributed database including literally billions of pages accessible through the Internet. Searching and indexing these pages to produce useful results in response to user queries is constantly a challenge. A device typically used to search the WWW is a search engine.
A typical prior art search engine 50 is shown in FIG. 1. Pages from the Internet or other source 22 are accessed through the use of a crawler 24. Crawler 24 aggregates documents from source 22 to ensure that these documents are searchable. Many algorithms exists for crawlers and in most cases these crawlers follow links in known hypertext documents to obtain other documents. The pages retrieved by crawler 24 are stored in a database 36. Thereafter, these documents are indexed by an indexer 26. Indexer 26 builds a searchable index of the documents in database 34. For example, each web page may be broken down into words and respective locations of each word on the page. Indexer 26 may also analyze the pages and extract textual metadata. The pages are then indexed by the words and/or metadata and their respective locations.
In use, a user 32 sends a search query to a dispatcher 30. Dispatcher 30 compiles a list of search nodes in cluster 28 to execute the query and forwards the query to those selected search nodes. The search nodes in search node cluster 28 search respective parts of the index 34 and return sorted search results along with a document identifier and a relevance score to dispatcher 30. Dispatcher 30 merges the received results to produce a final result set displayed to user 32 sorted by relevance scores based on a ranking function.
The creation of a comprehensive search engine for Internet multimedia content including audio, video and photos is so far an unachievable task. While search engines for documents and web pages are now very common and have been able to deliver good results, multimedia content does not lend itself to the same techniques used with textual documents and hence remains mostly not indexed and hard to find.
The challenge with multimedia content has to do with the fact that it is very hard to associate multimedia objects with textual metadata used in indexing. This is unlike textual documents from which keywords can be extracted and used for indexing of the document for later retrieval. With multimedia content, more often than not, no textual representation of the content is available and known methods to automatically generate a textual representation (for example by employing speech to text techniques) are very computationally intensive.
Since in many cases the multimedia content on the Internet is enclosed in some textual web page, a possible solution may be to use keywords from that page. However any given page may have many different multimedia objects enclosed and so it is difficult to determine what keywords relate to what object. Other approaches require the analysis of the multimedia content such as using computer vision techniques, in the case of video and images, or speech recognition for audio, in order to create a textual description of the content. To date, those techniques are unable to produce a useful multimedia search engine.
Another challenge pertaining to multimedia content is related to its discovery. Unlike web pages, the URL pointing to the actual media are sometimes buried in MACROMEDIA FLASH, JAVASCRIPT, JAVA and other hard to analyze code that makes it virtually impossible to quickly crawl the web and find new media. While a web crawler today discovers new content by simulating a person using a web page and by following all the links in the page, crawlers perform the process only by analyzing the HTML (hypertext markup language) code of the page. Even this rapid analysis leaves large portions of the web uncharted and undiscovered since the web grows too fast. Going beyond pure HTML analysis will slow a crawler to the point that it becomes too slow to produce sufficient coverage.
Several methods have been attempted to date to extract descriptive textual metadata from multimedia content or to use such readily available textual descriptions (when available). None of them have proven sufficient to build an Internet multimedia search engine.