As it is generally known, many situations call for computer software to organize information into categories. For example, FIG. 1 shows a simplified screen shot 10 illustrating a software generated user interface, provided through a Web Browser application, enabling a user to browse a set of project information objects. In the example of FIG. 1, the user has indicated to the system that he or she desires to browse the project categories within the Database API (Application Programming Interface) category, which is contained within the higher level category of Database Environment. In the example categorization of FIG. 1, a project can be classified by database environment, intended audience, operating system, programming language, translations, or user interface. FIG. 1 illustrates that categories may be nested in such systems, as in the nesting of the Database API category within the Database Environment category, and the various sub-categories within Database API.
Many other, different views of project categorization may be generated using a system such as that shown in FIG. 1. For example, a user may want to filter one or more categories to which a project might belong, e.g. view all projects that use JDBC or XML-based database APIs, and that are written in Java (Programming Language category). Those skilled in the art will recognize that many other examples of information categorization exist on the World Wide Web (WWW) and elsewhere. These examples include various employment databases, which may allow for filtering of available jobs by employer and location, online shopping Web sites, which may allow for filtering by product, brand, and/or product features, electronic mail (e-mail) systems, which may allow categorization of a single piece of e-mail into several different folders, and others.
As further illustrated by FIG. 1, a useful metaphor for existing software categorizers has been to view each category as a folder. In such systems, each folder in a hierarchy of folders can itself contain sub-folders representing sub-categories. A folder at any level can contain information items, such as projects, e-mail messages, shopping items, etc. Additionally, any information item can appear in more than one folder, and any folder can be a sub-folder under any number of other folders. However, a folder cannot be a sub-folder of another folder that is its ancestor, i.e. cycles are not permitted.
One technical challenge in implementing information item categorizations is the nested nature of the categories. For example, a categorizer for a job database should be equally able to find all jobs within a coarser category, such as those located in Massachusetts, as it is to find all jobs within a sub-category, such as all jobs located in Westford, Mass.
In more general terms, the problem to be solved involves categories forming a directed acyclic graph, with the leaf nodes being the items to be retrieved, and the non-leaf nodes representing the categories. The graph includes an edge from node a to node b if either i) node b is a sub-category of node a or ii) node b is a leaf item under category a. For any query, the system must be able to retrieve all leaf nodes reachable from a given collection of non-leaf nodes.
Using the above formulation, a straightforward implementation would be to calculate a reachability matrix as the transitive closure of the adjacency matrix of the graph. For example, such a reachability matrix may have a 1 for entry [i] if there is at least one path from node i to node j, and zero otherwise. Such an approach may be sufficient for fairly static applications, such as online shopping, in which the items or their classifications do not change frequently. However, for more dynamic applications, in which items are re-categorized more frequently, as in a categorization of a user's e-mail messages, this approach does not work as well, since the transitive closure calculation is expensive in terms of resources used.
An alternative approach using relational databases might store links to actual items as direct descendents of non-leaf nodes. For example, in a database corresponding to the interface shown in FIG. 1, the following records could be stored (for clarity only leaf-item containments are listed): (projectid-n, database-environment), (projectid-n, database-api), (projectid-n, JDBC). Thus the reachability matrix is stored as relational records, and the system copies only the identifier to avoid duplication of other information. Handling of leaf-node changes using such an approach is relatively easy, but non-leaf nodes are more difficult to change. For example, if the JDBC category got re-parented under a different super-category, the system would have to remove the records (projectid-n, database-environment), (projectid-n, database-api). Also, as the number of filters specified increases, the query to the categorization system gets more and more complex (more joins), and performance suffers as a result.
In the specific area of e-mail message categorization, Google's Gmail™ offers another, alternative approach. In the Gmail system, each piece of e-mail can have one or more “labels” attached to it, thus allowing the same mail item to appear in multiple views. However, a significant shortcoming of the system is that labels cannot be nested. Given that limitation, operations discussed above are relatively simple and efficient in this model.
For the above reasons and others it would be desirable to have a new system for information item categorization that allows for nested categorizations and optionally allows a single category or item to be contained in multiple parent categories, that can handle dynamic categorization changes, and that is simpler and more efficient than previous solutions. The new system should be generally applicable to a variety of applications, and specifically applicable to categorization of e-mail messages.