As the available amount of content grows, so does the need for effective search and retrieval systems. Search and retrieval systems (“search engines”) typically operate by receiving (e.g., from a user) a textual query of terms and/or phrases. The search engine compares the search terms (“keywords”) against an index created from a multitude of content items (e.g., text files, image files (e.g., .jpg files, .gif files), video files (e.g, .mpg files, .swf files, .avi files), and web pages or other items) and returns an indication of the most relevant content items. The classic example of a search engine is an Internet search engine that uses user-provided keywords to find relevant web pages and returns a listing of the most relevant ones. As is known to one skilled in the art, a web page may comprise a textual content item (e.g., a base hypertext markup language (HTML) document) and other content items (e.g., image files, movie files, sound files) presented by a web browser as a result of processing the HTML document and items referenced therein.
As the amount of digital data increases, search engines are being deployed not only for Internet search, but also for proprietary, personal, or special-purpose databases, such as personal multimedia archives, user generated content sites, proprietary data stores, workplace databases, and others. For example, personal computers may host search engines to find content items on the entire computer or in special-purpose archives (e.g., personal music or video collection). User generated content sites, which host process content items by creating indices. Once created, indices allow a search engine to map search terms to relevant content items without need to rescan all of the content items on each search. Therefore, the quality of search results is heavily dependent on the quality of the index generated. Accordingly, more effective techniques for generating accurate search indices are needed.