Information retrieval from a database of information is an increasingly challenging problem, as increased computing power and networking infrastructure allow the aggregation of large amounts of information and widespread access to that information. A goal of the information retrieval process is to allow the identification of materials of interest to users.
As the number of materials that users may search increases, identifying materials relevant to the search becomes increasingly important, but also increasingly difficult. Challenges posed by the information retrieval process include providing an intuitive, flexible user interface and completely and accurately identifying materials relevant to the user's needs within a reasonable amount of time. Another challenge is to provide an implementation of this user interface that is highly scalable, so that it can readily be applied to the increasing amounts of information and demands to access that information. The information retrieval process comprehends two interrelated technical aspects, namely, information organization and access.
Faceted Classification Systems
One method to address the information organization problem is to use a faceted classification system.
A faceted classification system is a scheme for classifying a collection of materials using a set of facets, where each facet represents a collection of related values or categories. For example, for a collection of materials representing a catalog of books, the facets might include Author, Subject, Year of Publication, etc., and the Author facet might include values like “Herman Melville” and “Mark Twain.”
The values in a facet may be organized hierarchically, with more general topics at the higher levels of the hierarchy, and more specific topics towards the leaves. For example, the Subject facet might include top-level categories such as “Business & Money” and “Computing & Internet.” The “Business & Money” category might include child categories such as “Careers & Employment,” “Management & Leadership,” “Personal Finance,” etc., and the “Computing & Internet” category might include child categories such as “Graphics & Design,” “Operating Systems,” and “Programming.”
Examples of partial facets for a books knowledge base are depicted in FIG. 1. FIG. 1 depicts part of the structure of an example Subject facet 110 and a Format facet 120. The Format facet 120 is an example of a flat facet, where the facet values such as “Hardcover” 130 and “Paperback” 135 do not have hierarchical parent-child relationships. The Subject facet 110 illustrates a facet containing hierarchical facet values, with parent facet values “Business & Money” 150 and “Computing & Internet” 180. Values in the subject facet have parent-child relationships, denoted by arrows from parent facet values to child facet values. For example, the “Business & Money” facet value 150 is the parent of the “Careers & Employment” facet value 160, which is in turn the parent of the “Cover Letters, Resumes & Interviews” facet value 170.
A faceted classification system assigns a mapping from each object in the collection to the complete set of facet categories that describe that object. Objects can be assigned an arbitrary number of categories from any facet. For example, a book might be assigned multiple Author categories, because books can be written by more than one Author. Yet a book might be assigned no value from the Illustrator facet, since it may contain no illustrations.
Faceted classification systems result in a more compact and efficiently represented taxonomic schema than traditional single-hierarchy approaches to object classification such as the Library of Congress Classification System. They are easier to extend as new dimensions of object description become necessary, compared to tree-structured systems such as the Yahoo directory.
Faceted Navigation Systems
While a faceted classification system addresses the information organization problem, it is still necessary to access this information. A faceted navigation system is a computer-implemented system that provides an interactive query refinement interface for locating and retrieving objects from a collection of materials described by a faceted classification scheme.
Typically, a faceted navigation system initially makes available the complete set of facet categories available that describe any objects in the database. The user of a faceted navigation system may select from these facet categories to narrow the set of selected objects. After the user makes a selection, the set of facet categories presented by the system is pruned to only those assigned to the remaining filtered objects. That is, the system only presents categories for which there exists an object described by both that category and all other previously selected categories.
Such an interface allows the user to select parametric query refinements incrementally, and in the process to narrow down the set of selected objects, effectively searching the database for some subset of interest. This search process is made more efficient and less frustrating by the removal of invalid facet categories that would lead to empty sets of selected objects, which are an undesirable result in most database search applications.
A faceted navigation system may organize the presentation of facet categories that are part of a hierarchical facet. For example, a faceted navigation system might show only the highest-level facet categories initially available in each facet, and provide controls for the user to expand to lower levels of the hierarchy.
U.S. patent application Ser. No. 09/573,305, entitled “Hierarchical Data-Driven Navigation System and Method for Information Retrieval,” and assigned to the assignee of the present invention, discloses a system and method for implementing a faceted navigation system. The contents of Ser. No. 09/573,305 are incorporated herein by reference.
Limitations of Prior Art
Faceted navigation systems are useful for searching a collection of objects where each object is described by a set of independent facet categories. But they fail to address the need to search databases with more complex structure, where users' constraints must apply to more than one related collection of objects, and the set of matching objects depends on the relationships between those objects and the objects in other collections.
As a simple example, consider a database containing both books and people who contributed to the books as authors. For simplicity, suppose that books are described by such facets as Subject, Year of Publication, and Author, and that people are described by Nationality and Gender. Example objects in this database are depicted in FIG. 2A. FIG. 2A represents the objects as they would be stored to correspond to real-world concepts, with an individual object used to represent each book 210, and a separate object used to represent each author 220.
One shortcoming of the storage approach depicted in FIG. 2A is the inability to perform faceted navigation based on the facet values associated with related objects. For example, a user might wish to navigate books based on the properties of their authors (e.g., search for all books by Romanian authors). But this type of navigation is not possible using the storage approach of FIG. 2A.
To accomplish this task in a faceted navigation system, a system might assign categories of the author to the book objects, as depicted in FIG. 2B. For example, a faceted classification system for books could have the facets Subject, Year of Publication, Author, Author Nationality, and Author Gender. This approach may work for books that have a single author, such as book 230, but becomes problematic for books with more than one co-author, such as book 240. A search for books by American women will return books where at least one co-author is American, and one is a woman (such as book 240); but on some results those might be different co-authors (as with book 240), which may not have been the intended interpretation of the search. The source of this problem is the many-to-many relationship between books and authors: this type of data relationship in combination with the limitations of the faceted classification model cause the system to flatten the information about multiple authors into a single book object, losing the information necessary to answer the query correctly.
An alternate approach to providing faceted navigation on books in this schema is to expand the unique book-plus-author combinations into individual records described by the facet categories of the book and a single co-author, as depicted in FIG. 2C. This approach addresses the need to preserve the relationships between the facet categories associated with individual co-authors in order to answer queries correctly. In effect, it de-normalizes the data from its many-to-many form into a one-to-one form. But this approach gives rise to two new problems:
The first problem is that duplicate book results will be returned (250, 260). For example, in the knowledge base depicted by FIG. 2C a search for books on the subject of “Computer Science” would return two results for the book entitled “Algorithmic and Computational Robotics,” one duplicate for each of the two co-authors.
The second problem is that the size of the database is expanded. In this example, since a unique record is required for each book-plus-co-author combination, the size of the database is increased by a factor equal to the average number of co-authors per book.
The first of these problems can be solved with extra query processing to detect and aggregate duplicate records (e.g., using the equivalent of a SQL “GROUP BY” statement). But the second problem can be especially problematic in the context of more complex schemas. The increase in database size in the books example may be acceptable; the majority of books are associated with just a single author, and the average number of authors per book in most real-world databases would be two or less, so no more than a doubling of the database size would be incurred. But the problem becomes more significant with the example depicted in FIG. 3, which illustrates a database storing information about alumni, the degrees they received, and the gifts they gave to the school.
A faceted navigation system could be used to search the set of alumni based on the facet categories of the gifts they had given and the degrees that they received. For example, it might be desired to locate alumni who had received an MBA in 1995 and who had given a gift of $500 in 2005. As in the books/authors example, flattening all of the gift and degree facet categories onto the alumni records loses information about the data interrelationships. This query would then return results such as an alumnus who gave $500 in 2004 but only $100 in 2005, which is undesirable behavior. And in this case, the approach of creating a record for each unique alumnus-plus-gift-plus-degree combination leads to problematic growth in the size of the database, as the expansion factor is determined by the three-way cross product among the different object types. For example, suppose that the average alumnus received 1.5 degrees and gave an average of 8 gifts. This would lead to a 12 times growth in the size of the database.
More complex examples only exacerbate the problem, with each one-to-many and many-to-many object type relationship contributing an additional multiplicative factor to the size of the database growth factor. In general, the number of records needed for faceted navigation using the “unique combinations” approach grows exponentially in the number of object types with one-to-many and many-to-many interrelationships, making the storage of databases with even a modest number of object types intractable.