1. Field of the Invention
The present invention relates to the field of categorization of items in general. More specifically, the present invention relates to the categorization of cases, such as documents, in a topic category within a hierarchical organization of cases.
2. Prior Art
With the increased amounts of data being generated, stored, and processed today, it is increasingly important for organizations to maintain their databases (e.g., collections of documents such as a customer support knowledge base) in an orderly manner. Many organizations rely on hierarchical schemes to organize their databases. A hierarchical organization of data utilizes successive levels of sub-categories which further narrow the scope of a category until a particular case (e.g., a document, file, program, etc.) is identified in the hierarchy. The advantage of such a system is that the hierarchy is easily navigated, even by users who are not expert with a particular database.
One problem with such a system is the reclassification of cases in a database hierarchy after changes to such a hierarchy. Organizations may decide that it is necessary to change their classification scheme to better suit their needs. For example, this can be the result of wanting to make the database easier to navigate, creating new categories, merging categories, splitting old categories, or moving cases between categories. The reclassification of cases afterwards can often require as much effort as the original classification process and may be complicated by the fact that a single case can belong in multiple categories.
The worst case scenario of how to cope with these hierarchy changes would be to classify items into the new hierarchy, without leveraging any information about the old classification of items. Another method is to manually reclassify only those cases that are affected by the changes to the hierarchy (e.g. moving batches of cases from one category to another one). If the changes are simple enough, such as renaming a category or creating a new category with no items in it, no reclassification is needed. However, most changes require much more effort and are difficult to implement. If this reclassification is performed manually, the possibility of mis-classification can be a problem. The database hierarchies can contain millions of cases and anyone reclassifying cases would require expert level knowledge of the entire new hierarchy to correctly perform their task. There is a need for a solution that facilitates the migration of cases when changes are made to a hierarchy.
These problems are magnified in organizations that maintain multiple hierarchical databases containing similar information. These organizations may maintain separate hierarchies for a variety of reasons. For example, an organization may find accessing particular data more efficient when a variety of hierarchical schemes are employed rather than just one. In one hierarchy, cases may be organized according to the operating system they pertain to. Another hierarchy may organize cases according to what application they reference. Although such hierarchies are separate, there may be relationships among them. The same solution that helps with changes to a hierarchy can also be applied to use classification in one hierarchy to facilitate classification in another.
FIGS. 1A and 1B illustrate exemplary data hierarchies 100 and 101 used to organize data (e.g., business metrics, transformed data, and raw data) and information in an organization utilizing separate hierarchies. In FIG. 1A, a hierarchical database 100 has a root level directory 105 containing two sub categories: operating system 1 (110) and operating system 2 (115). Operating system 1 has further sub-categories of hardware 120 and software 125.
In FIG. 1B, database 101 has a root level directory 150 containing two sub categories: hardware 160 and software 165. Hardware 160 has been further sub categorized with a category for printers 170. Software 165 has been further divided into a categories for applications 175 and operating systems 176. Application 180 is a sub-category of applications 175, while operating system 1 and operating system 2 (190 and 195 respectively) are sub-categories of operating systems 176.
Hierarchy 100 represents a hierarchical scheme currently used by an organization. Hierarchy 101 represents a new hierarchical scheme that the organization is moving to, or one of a number of hierarchies used simultaneously by an organization. The data in both hierarchies is organized utilizing successive levels of sub-categories which further narrow the scope of a category until a particular case is identified in the hierarchy. For example the user can navigate through hierarchical organization 101 by selecting an item from the top-level menu (e.g., either xe2x80x9chardwarexe2x80x9d or xe2x80x9csoftwarexe2x80x9d). The user can then make further selections at each subsequent level of hierarchical organization 101. After selecting xe2x80x9csoftware,xe2x80x9d a user can then select xe2x80x9capplicationsxe2x80x9d or xe2x80x9coperating systems.xe2x80x9d The user can move backwards or forwards (up or down) in hierarchical organization 101; for example, from xe2x80x9coperating systems,xe2x80x9d the user can move back up to xe2x80x9csoftwarexe2x80x9d, or to xe2x80x9coperating system 1xe2x80x9d, or xe2x80x9coperating system 2.xe2x80x9d
Accordingly, what is needed is a method of efficiently migrating data from one categorization hierarchy to another hierarchy. A further need exists for a method which meets the above need and allows categorization information to be shared among a plurality of related hierarchies such that the categorization of an item in one hierarchy is leveraged to facilitate the categorization of that item and others in another hierarchy.
The present invention facilitates efficient migration of data from one categorization hierarchy to another hierarchy. It can determine the best category in a new hierarchy for cases previously classified in an old hierarchy and can automatically derive a classifier for the new hierarchy to classify new items. The present invention can be used as a xe2x80x9cvirtualxe2x80x9d classifier by combining classifiers for a plurality of related hierarchies. Classifications made in one categorization hierarchy (e.g., adding, deleting, or moving a document to a different category) are updated across the plurality of related hierarchies and can be used to help classify other documents in the related hierarchies as well.
Embodiments of the present invention are directed to a method of efficiently migrating data from one categorization hierarchy to a new hierarchy. Data, item, document, and/or case refer to any file, document, program, raw or processed data, or any information which may be contained in a data hierarchy. A mapping is created which describes where the cases in one hierarchy will be placed in a new hierarchy. The classifier of the first hierarchy is merged with this mapping to act as a classifier for the second hierarchy. Cases from the first hierarchy are classified in the new hierarchy using this merged mapping. In another embodiment, a training set of classified items is designated from a first hierarchy and mapped to a second hierarchy. Using machine learning, a classifier for the second hierarchy is created and used to classify subsequently migrated cases.
Migration of data using the present invention requires much less human effort, and is likely to be more accurate than manual reclassification. Induced classifiers via machine learning technology are directly dependent on how large a training set is available, and the present invention provides a way to transfer the old training set to the new hierarchy, reducing the cost and delay to obtain a new training set sufficiently large to accurately induce a classifier.
The present invention can act as a virtual classifier for multiple hierarchies in an organization, providing updated categorization information for multiple hierarchical databases. Cases classified in one hierarchy are used to help classify those cases in all of the other hierarchies to which a mapping exists. For example, if a domain expert makes a single classification in one hierarchy, that item can expand the training set used for all related hierarchies, thereby improving the accuracy of the derived classifiers for those hierarchies.