Web search engines manage increasing amounts of structured data, which is typically represented in the form of objects described by a set of attributes and related to other objects through relations. The objects may be associated to web pages (typically, a search engine may extract a number of objects from a web page), or the search engine may handle only object data, without knowledge of the provenance of the objects. A typical task in these settings is to retrieve an object or a set of relevant objects in response to user queries of some form (keyword queries, structured queries or a hybrid).
Search Engine Background. A search engine allows client devices to search for files of interest in response to queries. The search engine may include a crawler component, an indexer component, an index data store, a search component, a ranking component, a cache, a profile data store to provide persistent storage for one or more user profiles, a logon component, a profile builder, and an application program interface (“API”) that may be used to execute functions for storage, retrieval and manipulation of data in the index data store and profile data store. The search engine and its constituent components may be deployed across the network in a distributed manner whereby key components are duplicated and strategically placed throughout the network for increased performance, e.g., close to the edges of the network.
The term “Boolean search engine” refers to the use of Boolean-style syntax in a query by a user. A Boolean search engine allows the use of Boolean operators (such as AND, OR, NOT, and XOR) in a probabilistic context to specify the logical relationship between search terms. For example, the search query “college OR university” may return all results with either “college” or “university” or both, while the search query “college XOR university” may return only all results that have only “college” or “university” but not both.
In contrast to Boolean search, “semantic search” is a search technique intended to improve the relevance of search results by incorporating an understanding of the contextual meaning of search terms as well as the user's intent. Rather than using Boolean-style syntax to specify the relationship between search terms, semantic search attempts to infer the meaning of each individual word in a natural language search query. Semantic search applies “semantics” (the science of meaning in language) to retrieve information from richly structured data sources such as ontologies.
The search results located during a search of an index performed in response to a query received from a user will generally then be ranked. The index has a plurality of index entries, wherein each index entry has a weight. The query may include a plurality of query terms, wherein each query term corresponds to an index entry. Search results are sometimes ranked by scoring each located record according to the number of times portions of information corresponding to each query term occur in each record and the weight of each index entry corresponding to each occurring query term. Proximity of query terms within located records, and/or context or “semantic” information from the Semantic Web (stored with languages such as Resource Description Format (RDF) and RDF Schema (RDFS), or other variants of Extensible Markup Language (XML) or the like) may also be considered in weighing the score. The score and an identifier of each located record are then stored in a respective entry of a ranking list.
The entries of the ranking list are ordered according to the scores. The information associated with each located record may then be provided to the user in the order of the ranking list. For example, the provided information associated with each located record may be the score of each located record and/or the identifier of each located record.
User editable search interfaces have existed before, see e.g. Google's SearchWiki or Mahalo's user written search result pages. These examples relate to document search, where de-duplication removes only documents that are exactly the same, and this can be easily automated to its full extent. Editing search results consists in annotating or reordering search results, or writing an entirely new results page. In object search, de-duplication means identifying and merging (or removing) duplicate results that relate to the same real world object, which is more difficult, and practical implementations using Machine Learning require training data created by humans. In object search, the content of search results is structured and can be displayed in a way that makes it feasible for the user to edit parts of the Web document. Existing object search engines typically do not allow user feedback. Some search engines allow users to remove sources from object search results. When removing a source, all the results from that source are removed. These actions are not saved by the engine and are not applied to future searches.
An example of a vertical search engine is Yahoo! Local which searches over a curated collection of structured data for representing business listings sourced from multiple trusted data providers. Editorial curation is the term given to the human filtering and organizing of content on web sites. These vertical search engines don't have as much coverage, however, as Web search. See FIG. 1. An example of a structured search can be seen in Yahoo! Search when the user clicks on the “Local Business Listings” facet on the left bar after generating a search. This structured search is powered by information extracted by Yahoo! The search results shown in FIG. 1 are typical in that they feature duplicate listings 102-108.