Inexpensive computer and networking technologies have made large quantities of digital content available to Internet and mobile network users, resulting in information overload. As a result, users have access to much more information and entertainment than they can consistently and reliably locate, even via large-scale, centralized public search engines.
Concurrently, significant practical and commercial value has been provided by text and data search technologies, the goal of which is to identify the information of greatest utility to a user within a given content collection, such as the information that is created and managed by large-scale publicly available internet search engines.
The resulting proliferation and commoditization of information search and retrieval technologies have created an increasing number of proprietary commercial data, media and text collections, independently indexed and maintained by content sources. These content sources have limited economic incentive to make their digital content fully accessible for indexing by public search engines and the public search engines attain more economic benefit by having these sources sign on as advertisers than by providing their users with direct access to the actual content.
Most contemporary search engines are designed to pre-index a collection of resources (e.g. document, image, web site), then, in response to a query, examine collections in one or a group of computers for content that satisfies the query and return an ordered list of possible matches to the user as a results set. A result item metadata that indicates relevance ranking, meaning how closely the content matches the query, may be explicitly returned or may be given implicitly in the order of items in the results set, usually with the most relevant item at the top of the list. Rankings may be based on a numerical similarity scoring value or one of many possible metrics previously computed against the content and stored with the full-text or database index or indexes by the content publisher.
Search engine query and indexing architectures vary to at least three types: centralized indexing, metasearch, and federated search engines. Each type may be used to conduct searches against different types of content collections. For example, centralized indexes may be used to facilitate searches over fully accessible, homogeneous content, such as is found in single enterprise content management systems or the plethora of publicly available, internet-enabled websites.
A metasearch engine may combine results from several external search engines or database indexes. It has colloquially come to mean a search across collections with homogeneous, textual content collection indexes, e.g. multiple internet search engines or bibliographic databases.
A federated search may also combine results from more than one search, with each search typically being conducted over heterogeneous content collections, such as are associated with different types of indexing engines, e.g. mixing content from full-text search engines and databases, different information resources such as from different file servers or different content types, or requiring access to differing proprietary collections as when searching multiple sports sites including sports news, sports apparel, and sports team merchandise.
For a metasearch or federated search to be maximally precise, it should find the resources that score highest with respect to the metacollection, not necessarily those that score highest with respect to the individual collections in which they reside. For example, in a federated search over the combination of two different collections: sports and technology news; if a query contains the term “computer”, an incorrect implementation would give undue weight to computer-related documents that appear in the sports collection. The practical impacts of this effect are substantial to the extent that a metacollection is used to cull information from diverse collections, each with a different specialty or focus.
In addition to traditional content access via stationary computers, there has been an explosive proliferation of internet access using mobile computing devices such as laptops, personal digital assistants (PDAs), and mobile telephones. This proliferation is markedly changing the nature of content access while content publishers reformat and reorganize their content for mobile access. While a desktop computer user can comfortably search for information, using multiple tries and browsing, mobile computing users are generally limited by small screen and input ergonomics, location-specificity, and their own mobility. Due to these constraints, mobile computing users are less likely to want to receive all possibly relevant results, and more likely to want specific information immediately.
This changing nature of content access plays a large part in increasing the value of information retrieval precision over recall with new search and retrieval processes emphasizing the highest possible precision in the first five to ten entries of the results set. For the same reasons, mobile users also require the shortest path to their desired content. Therefore, search results items should allow the user to directly access interesting content items rather than providing access to a list of content sources.
Other challenges to federated search functionality may also be present. Different sources may index their content collections using different algorithms or by processing the same algorithms against different sections of text and/or metadata. Thus local source calculated ranking statistics may not be compared directly when combining results sets.
Different sources may contain overlapping resource collections, which may result in the same content item appearing in results sets from both sources. Traditional de-duplication algorithms remove all duplicates based on a metadata field value or set of field values. For example, a news source may remove all content items with the same headline, byline, and date values.
Various sources may contain similar content but include varying depth of content (extensiveness of the collection) or may vary in response characteristics (latency, percent uptime). These variations can negatively impact the user experience by generating insufficient results or by not responding before system or user-perceived timeouts. Federated searching across multiple content sources improves the chance that the user will get some response to their query within a reasonably time frame.
Additionally, there may be wide variation in relevance of a content collection to the query. Not all available content sources contain collections sufficiently relevant to warrant inclusion in the metacollection.