Vast amounts of digital information are available to users through the interconnection of computers and storage by the Internet. Indeed, traditional paper records can be scanned and retained electronically. Emails or instant messages between a company's engineers, planners, or financial officers can have the same or greater significance as formal memorandums or journal articles in a knowledge database. Although available in great quantity, digital information must be accessed in order to be used efficiently and profitably. Typical search methods use queries consisting of a Boolean combination of words and phrases which return documents containing words or phrases that match the search query. Imprecision and ambiguities in the words and phrases affect the effectiveness of searches. Refining a search takes time and skill and a user's search capability usually improves with practice and experience.
However, businesses need fast and reliable search capabilities that are not dependent on a user's skill and experience. Speed and reliability are improved by a search capability that precludes a null result. A null result occurs when desired information is not present in the data base, or the desired information is present, but the search term fails to locate the information. Speed and reliability are improved by a search that is controlled so that users are logically directed to the sought after information without distractions by non-relevant information. Thus the record count of documents responsive to a search should never be zero, and should include only documents that are relevant to the search.
In addition to speed and reliability, businesses need a search capability that can restrict accesses to portions of its digital information. For example, not all employees have a need to access all information. Trade Secret information, while part of the total universe of a business's digital information, must have access limited only to those with proper authorization. Moreover, one division of a corporation may not need access to another division's information. Employee files including health records and financial data must be protected in accordance with various State and Federal laws. Thus the need for information access with speed and reliability is affected by equally important needs to restrict access where appropriate.
Personnel responsible for organizing the universe of a businesses' digital information have found that a controlled search can be achieved by a faceted search using a facet tree. A facet tree is a categorization schemes for hierarchially structured data and content records where the facets and facet elements of the tree guide the search and a posting list of records is associated with each facet element. A facet is a node in the hierarchial network and a facet may have any number of facet elements. The total number of records in a facet is the sum total of all records in the posting lists of the facet elements beneath the facet in the facet tree.
A faceted search is conducted by a faceted search engine. Selection of a facet (a word or phrase) causes the faceted search engine to present the next set of facets, and so on along the facet tree until a final facet is reached and the facet elements are displayed. When the user selects one of the facet elements, the documents on the associated posting list are displayed. The relationship of the facets to each other and of the facets to the facet elements is created by tagging. In general, a tag is one or more characters attached to a set of data that contain information about the set, including its identification. By tagging is meant assigning a value to a facet so that it is identified with another facet in a hierarchical relationship and assigning a value to a facet element so that it is identified with a particular facet.
Current approaches to faceted search are based upon the tagging of data and facet elements in a strict parent/child relationship. Using a faceted search engine, end users choose one facet or facet element at a time. Document or data record counts are displayed in conjunction with the facet elements, such that the end users are aware of the total number of records and also how many data records exist at a given facet element within a facet. The total number of records means the number of records for all of the combined facets and facet elements that have been selected.
Faceted search has the advantage of eliminating null results when end users are searching for records across many combined dimensions. To do this, data records that are tagged to a facet element are also made available at facet levels. By way of example, a user might use a faceted search engine to locate content in geographical, industry, and technology facets by choosing “Europe,” “Retail,” and “Supply Chain” as facets. When one does this, all of the records in posting lists identified with the facets of Europe, Retail, and Supply Chain are aggregated and returned. The total number of records returned is the set intersection of all of the data records rolled up to Europe through Retail and Supply Chain.
Another form of faceted search is a conversational search. In a conversational search, end users are presented with a series of questions derived from or tagged to a facet hierarchy, and the answers given by the end users result in faceted responses rendered by a conversational search engine. The response may include search engine results at the record level finding, for example, documents, titles, abstracts, and other information about those documents. The response may also include navigation from the facet that the question was asked about, to a succeeding facet or facet element. In a conversational scheme, the semantic content of the questions matters, and so the order in which the questions are presented is important.
Facets are most often applied to inverted indexes. An inverted index is defined by the National Institute of Standards and Technology (NIST) as an “index into a set of texts of the words in the texts.” NIST further states that “[e]ach index entry gives the word and a list of texts, possibly with locations within the text, where the word occurs.” Thus when an inverted index is accessed by a search engine, a word can be located and its location in a text displayed. The index then is as large as the sum total of each single occurrence of an individual word in the texts in the database. In order to prepare the index for faceted search, index terms must be selected as facets or facet elements and the hierarchial relationship established by tagging a facet to another facet, a facet to a facet element, or a facet element to a posting list.
Businesses have a need to apply policies to tagging to meet requirements for restricted access for security and legal reasons, but also to apply policies to tagging to promote the efficiency of providing results to end users. Automated tagging algorithms exist such as the Eureka algorithm and software from IBM Research. Other automated tagging algorithms are known. For example, United States Patent Application 2002/0091661, entitled “Method and Apparatus for Automatic Construction of Faceted Terminological Feedback for Document Retrieval” (the '661 application), discloses a method for extracting key informational concepts or “facets” from a collection of documents. In the '661 application, facets are chosen heuristically based on 'lexical dispersion. As used in the '661 application, lexical dispersion is “a measure of the number of different words with which a particular word co-occurs within such syntactic contexts.” The facets and their corresponding lexical constructs provide a structured summary of the contents of a result set.
However, these automated solutions rely on factors such as word frequency in a document and cannot tag documents in accordance with business rules. In order to tag data records and documents in accordance with a stated policy, a person must manually perform the tagging according to specific business rules. While algorithms are capable of approximating human tagging based on semantic categorization, they cannot perform tagging in accordance with policies. Additionally, each time a tagging policy changed, the algorithm would need to be changed. Therefore, although labor intensive, human tagging is necessary to make the policy distinctions that cannot be made by a machine.
Policies that affect tagging can be categorized as restriction policies and efficiency policies. Restriction policies include protection of trade secret and proprietary data, protection of data for legal requirements, permissions on who can create or alter a facet tree, permissions on who can view the original configuration of a facet tree that has been altered, and permissions on who can create or view an altered facet tree. Efficiency policies include rules regarding display of the data based on semantics (the specific facet under focus), display of the data based on end user characteristics, display of the data based on the kind of content, display of the data based upon validation by role (such as authoritative end users, content administrators, auditors, and so on), display of data that has been auto-tagged (shown unaltered with a warning message until validated), display of the facets in different sequences, and display of records based on currency. These two policy areas allow for any rules that a business may need to ensure that only those who need to access the records do so. In addition these two policy areas allow for rules to ensure that records will be displayed consistent with the business's requirements and best business practice policies.
One example of an efficiency policy is that companies need the ability to apply specific business rules, since some data elements are “tagged to” a child element, and thus should “count toward” the total number of data records for a parent element without actually being “displayed” at the parent data element level. A particular product, for example, may be sold in Europe because it is sold in France. From an end user search perspective, what is needed is an invention that allows for “counts” of search records to be aggregated to the parent level, while disallowing display of the records in a faceted search interface until the end user navigates to the appropriate child nodes. In other words, end users should be able to distinguish between facet elements that belong at different levels when necessary, and also to see all results when necessary.
One of the main issues with current faceted search implementations is that they do not allow those who “own” collections of faceted content to display results that are tagged in a case specific fashion. As explained in the above section, businesses often have a need to display documents or data records in a fashion that “counts membership” in a parent, while delineating those records so that the records are known not to apply at a particular parent level. In a business context, for example, a service provider may offer services in Florence, Italy. In standard faceted search implementations, records for the service provider are displayed for all of Italy, for the region of Tuscany, and for the city of Florence. Yet the service provider may offer services “only” in Florence, or at any of the specified levels of the given hierarchy. In some contexts, then, an end user searching a collection of service providers may wish to know that a certain number of child records or documents exist below the level either of Italy or of Tuscany, without viewing the child records, or at the very least by viewing the child records in a fashion that delineates their difference from records that “belong” at multiple levels of a hierarchy. Similarly, an owner of content or an administrative role may wish to enforce rules to disallow display of such items, even though they are present in the faceted counts of documents or data records for a given facet element. There is a need for administrative capability to specially configure display of facet results at the facet level by end user characteristic such as audience or role membership, or by combined facet and end user characteristics. Current approaches do not directly address this problem. The closest technical solution to the problem is Online Analytical Processing (OLAP) database technology, but such an approach encounters severe performance challenges due to the large number of relational joins required, and imposes design constraints on dimensions that are added to the OLAP database/storage.
Another example of an efficiency policy, is that current approaches to faceted search of web content provide very limited ability to conditionally reorder facets or facet elements, instead providing a single ordering for a given collection of content. While this is less of a problem for basic navigation through sets of contents tagged with facet elements, it becomes a very significant problem in other contexts. In the conversational scheme, facets are more than just navigational mechanisms to narrow a query or whittle down a result set. They are a mechanism by which end users provide data that is then saved and put to various uses. For example, facet sequences may be rendered in a conversational style so that end users may apply metadata to a set of documents or data records. Similarly, facet sequences can be provided to some administrative roles such that compliance with business rules may be verified in a particular order. Facet sequences may be provided to end users as a means of collecting input for problem determination. Finally, facet sequences are an important navigational alternative to showing all facets at the same level and allowing end users to navigate in any facet order. Because faceted navigation tends to exceed the contents that can be held by a single screen, users with limited screen space may need to scroll excessively. Consider, for example, a sales representative who accesses a corporate intranet via a PDA such as a Blackberry. A mechanism to present facets in a particular order would provide a much more manageable interface for such an end user, or perhaps any end user with limited screen space. Finally, an end user may need facets only in a particular order. Rather than navigating through the same set of questions again and again, end users in a conversational system should be able to arrange the “prompts” in one or more orderings of their choosing.
Therefore, a need exists for a way to configure the view of facets and for a mechanism that allows for quick specification, saving, and recall of a facet order.