Information retrieval has traditionally involved the end user to formulate a query using Boolean operators—either using a query language or via graphical user interface. Execution of the query provides a search result that is a set of matching documents. This result set has generally been a classical crisp set of which a particular document is either a member or not a member.
We will throughout this invention use the term “document” for any searchable object, and it could hence mean for instance a textual document, a document represented in XML, HTML, SGML, or an office format, a database object such as record, table, view, or query, or a multimedia object.
A query Q is applied to a document set D (the search space) under the assumption that a certain subset of D, namely P, is an appropriate result for the query Q. The recall is the fraction of P returned in the result set R, i.e. IR.andgate.PI/IPI. The precision is the fraction of R that is relevant, i.e. IR.andgate.PI/IRI. Typical search systems have precision-recall curves showing a trade-off between precision and recall as depicted graphically in FIG. 1, which shows how increasing precision lowers recall and vice versa. Great precision is only achieved with poor recall and vice versa. The search system is tuned to offer acceptable precision and recall.
However, with huge content volumes where many documents share the same keywords, the result sets become too large to be efficiently presented to a human user. More recently, information retrieval systems calculate a relevance score as a function of the quality of the match between the query and the document, as well as including a priori probabilities that the document is valid for any query (e.g. page rank from Google). The search result is presented ranked according to this relevance score, showing the details of the documents with the highest relevance scores first, usually in hyperlinked pages of 10-20 documents. The concepts of recall and precision are not as clear-cut as for the crisp result sets above, but they still apply.
Recall refers to getting relevant documents included in the search result and preferably on the top of the first result page. Precision involves not having irrelevant documents on the first result page.
The user interacts with an information retrieval system (a search engine) by analyzing the search result, viewing result documents, and reformulating the query. The search result is often too general, as the user does not generally know the extent of the collection of documents in the system and thus does not make the query specific enough (i.e. having poor precision). A common query reformulation is to make a query refinement, i.e. selecting a subset of the original search result set in order to improve the precision.
Very recently, information retrieval systems have included the concept of result set navigation. As examples of published prior art, see for instance U.S. Pat. Nos. 7,035,864 and 7,062,483, assigned to Endeca technologies, Inc., and NO patent application No. 20052215, assigned to Fast Search & Transfer ASA. A document is associated with multiple attributes (e.g. price, weight, keywords) where each attribute has none, one, or in general multiple values.
The attribute value distributions are presented as a frequency histogram either sorted on frequency or value. A navigator is a graphical user interface object that presents the frequency histogram for a given attribute, allowing the user to analyze the result set as well as select an attribute-value pair as a query refinement in a single click. The refinement is instantly executed, and the new result set is presented together with new navigators on the new result set. For example, a search for “skiing” may include a “Country” navigator on the “Country” document attribute (metadata). This navigator contains a value “Norway” suggesting that there is a substantial number of documents in the result set for “skiing” that are associated with Norway. When the user selects the “Norway” option in the navigator, the system presents the subset of the “skiing” result set that is further limited to documents associated with Norway. In FIG. 2 the query 201 gives a result set 202 together with navigators on document-level metadata 203-205. In the example, a search 201 for surname “Thorsen” and first name “Torstein” allows the user to refine the first name among those in the result set (204) and to constrain the search to a part of the country (203). For each of the refinements, the size of the result set if the refinement was to be applied is shown.
Navigation includes many concepts of data mining. Traditional data mining is on a static data set. With navigation, data mining is employed on a dynamic per-query result set. Each document attribute represents a dimension/facet in terms of data mining terminology.
Formally, given a query Q, a navigator N on the attribute a having values {v} across a set of documents D has N(Q,a,v) instances of value v. The set of values for attribute a in document d is d(a).
N(Q,a,v)=|{d in D: Q matches d,v in d(a)}|
Both the attribute values v and the document hit count N(Q,a,v) are presented, typically sorted either on the values or document hit count.
Navigation is the application of result set aggregation in the context of a query where a result set summary is presented to the user as well as a query modifier that is incorporated in the query when the user selects a particular object in the summary. The presentation is a view of the result set along an attribute dimension and may include a quality indicator in addition to the attribute value, where the quality usually is the number of documents for a given attribute value or attribute value range.
The ideas discussed below incorporate both aggregation in the general case and specifically the application to navigation. The aggregation can be presented without necessarily linking it to query refinements, or it may be the basis for statistical analysis without even being presented. Also, the information retrieval system may choose to automatically select such query refinements based on an analysis of the query, the result set, and the navigators/aggregations associated with the result set.
The document-global attributes (metadata) are either explicit in the document or structured database records or automatically discovered attributes in the unstructured content of a document using techniques from the field of information extraction. In hierarchical structured content (e.g. from XML), sub-document elements can be explicitly associated with attributes.
Automatically extracted information can be associated at the global document level and at the contextual (sub-document) level, e.g. at sentence elements. The sub-document elements can be explicit in the content (e.g. paragraphs in HTML) or automatically detected (e.g. sentence detection). The distinction between attributes and elements is with respect to the visible content flow: the content of elements is visible whereas the attributes are invisible metadata on the elements. For example, the content of sentence elements is visible including entity sub-elements (e.g. person names), but the sentiment attribute on a sentence element should not interfere with the content flow, e.g. phrase search across sentences. Likewise, an entity element contains the original content while an attribute contains the normalized version of the content that is used for search and analysis. For example, the text “yesterday” is wrapped in a date entity with an attribute containing the concrete date value normalized to the ISO 8601 standard as derived from the context.
The present applicant, viz. Fast Search & Transfer ASA, has recently introduced contextual navigation, cf. NO patent application No. 20052215, on sub-document elements, e.g. paragraphs and sentences as disclosed in the above-mentioned Norwegian patent application. Entities are extracted from e.g. sentences and marked up as sub-elements of the sentence elements or as attributes on the sentence elements. The search system allows e.g. specific sentences to be selected by a query and navigation on the sentence sub-elements/attributes. For example, a query may select sentences containing “Bill Clinton” in a “person_name” sub-element and present a navigator on the “date” sub-element of those sentences. Such navigators are found to be much more relevant than equivalent document-level navigators on entities extracted from unstructured natural language content.
FIG. 3 shows aggregations of persons associated with the query “soccer” at the document X01, paragraph X02, and sentence level X03, clearly showing semantically more correct aggregations at the paragraph and sentence contexts than at the document level.
Sometimes a user will request specify a detailed query, and the result set will have too specific (or none) documents (i.e. poor recall). Some search systems allow the user to simply increase the recall, e.g. by enabling lemmatization or stemming that enables matching of alternative surface forms, i.e. matching different tenses of verbs, singular/plural of nouns, etc. Other recall enhancing measures are enabling synonymy, going from a phrase search to an “all words” search, and going from an “all words” search to an “n of m” (or “any”) search. Spell checking may work either way, improving recall or precision.
In order to scale for high-volume applications, search solutions have developed from software libraries handling all aspects of the search linked into a single application running on one machine, to distributed search engine solutions where multiple, sometime thousands, machines are executing the queries received from external clients. This development allows the search engine to run in a separate environment and to distribute the problem in an optimal manner without having external constraints imposed by the application.
The basis for performance, scalability, and fault-tolerance is the partitioning of the searchable documents into partitions handled on separate machines, and the replication of these partitions on other machines. In the search engine, the query is analyzed and then dispatched to some or all the partitions, the results from each partition are merged, and the final result set is subject to post-processing before being passed on to the search client.
Performance and fault-tolerance is increased by replicating the data on new machines. The search engines scales for more content by adding new partitions.
In traditional navigation on document-level attributes, a document having a low relevance score is counted equal to a document having a high relevance score. As the relevance score generally exponentially decays along the result set list, and documents have a fuzzy membership in the result set, navigators may include query refinements where the document count may be largely from the poor relevance hits.
FIG. 4 shows the relevance profile for a sample query on a sample content collection. The non-normalized relevance score has an exponential falling profile towards a tail level. For this particular query, the tail level is reached around hit number 100. Documents from hit 100 onwards are included in the result set but with a very low effective membership.
In particular, as recall improving search features are enabled, search precision falls, but generally, the relevance mechanisms in the search engine ensures that only very high-quality new documents are included at the top of the result list. However, precision in the navigators falls more, as every new document included in the result set is included in the navigators. The content of current navigators have a bias towards recall rather than precision, potentially luring users into poor query refinements by only offering the document hit count as a measure of quality.
Clients have limited screen real estate, in particular mobile devices, but even desktops suffer from information overload as too much information is packed into the viewable area. Navigator query refinements giving poor results deteriorate the user experience by information overloading and wasting screen space that could be better used for other purposes.
The aggregation of navigation data across partitions costs network bandwidth. A partition must return the frequency count for each value in a navigator as a partition does not know which values are to appear in the final navigator. For navigators having a large value space within the result set, the network bandwidth for distributed aggregation, prior to selecting the top N query refinements to the user, is a bottleneck for getting high search throughput. In particular, the inclusion of non-relevant (low frequency) values that will not be presented in the navigator, waste network bandwidth.
FIG. 5 shows a process schematic of distributed aggregation. The content partitions X01 are aggregated by processes X02 operating on the documents within the partitions that match the query. The aggregated results are passed through the network X03 to a global aggregation process X04. The global aggregation process may contain a hierarchical aggregation distributed over multiple aggregation sub-processes. Finally, process X05 presents the navigator. Navigators that have many unique values require substantial bandwidth on network X03.
The aggregation of the navigation data is typically across the full result set. For higher performance, saving network bandwidth as above as well as CPU, it can be performed on the top N ranked hits, where N is a configuration or a per-query parameter (so called shallow aggregation). In general, the N will not match the relevance score profiles of a wide set of queries so that only “super-relevant” documents are included (ref the tail level from hit 100 onwards in FIG. 4). It will be impossible to find a general value for N or to infer the value from the query alone. Even if such an N was found, there will be a substantial range of relevance scores within the relevant documents and all documents are counted equally independent of relevance score.
However, as seen from the above navigation and navigation tools are encumbered with some drawbacks, particularly with regard to applications or refining the queries in a manner that ensures an improvement in the quality of the search result and somehow tackles the problem that derives from using inappropriate measures of quality—an obvious example would be cases where recall is preferred to precision.