Embodiments of the invention are generally directed to an approach for managing the classification of assets in a business glossary. More specifically, embodiments of the invention provide a variety of techniques for refining the manual classification or categorization of assets relative to a business glossary based on a set of attributes associated with the assets and the classification of other, similar assets.
Large organizations frequently use a variety of software applications and systems to define and manage a business glossary. The business glossary itself may provide a controlled vocabulary of terms used within the organization (and across sub-organizations). Terms in the business glossary represent the major information concepts in an organization and categories are used to organize terms into hierarchies. The business glossary allows data analysts, business analysts and subject matter experts to create a rich glossary of business terms, hierarchies and relationships. The business glossary links business concepts to technical metadata and can expose these linkages across the entire enterprise using a variety of user interfaces.
For example, a web-based tool may include a user interface for creating, managing, and sharing the controlled vocabulary of the business glossary. In addition to maintaining the controlled vocabulary, such an interface may provide a classification scheme along with a taxonomy of terms and categories and allow a steward to assign terms to business assets. “Stewards” generally refers to people within the organization with responsibility for a given information asset—typically a subject matter expert tasked with managing a group of terms. This assignment is often manual, where the steward relies on his domain knowledge to perform this task.
However, it is well known that manual classification often results in naive assignments based on any appropriate class (term/category) that a steward identifies. That is, rather than examining the existing classes present in the glossary, a steward may assign assets to classifications on a “first best fit” basis. While this results in an accurate classification, it may be unnecessarily general for the classified asset and inconsistent with classifications for similar assets. For example, a steward could assign an asset of a delivery truck to an asset classification of “vehicle” or “vehicle-truck,” when a further term of “vehicle-truck-delivery” existed in the business glossary. Further, when two organizations merge (or one organization splits into smaller units) new assets may need to be classified, terms in distinct business glossaries may need to be merged and reconciled, etc.
Various approaches to automatically assign assets to a particular class have been made as well. Automatic classification mechanisms typically rely on external descriptions about the asset, above and beyond what already exists in a glossary and then apply some natural language processing techniques to extract features that may be useful in classification. Another approach has been to try to train a classifier based on the existing manual classifications as a training dataset. However, the training itself relies on the manual assignments, which is often be problematic due to the reasons mentioned above.