The invention relates to taxonomy system for enterprise data management and analysis.
The concept of taxonomy evolved from the life sciences. In the scientific community, taxonomies were conceived as a way to organize and categorize life forms into a structured and controlled hierarchy. In this approach, a plant or animal is placed in a single spot describing its hierarchical relationship to other plants and animals.
Classification or categorization of animals and plants is the arrangement of entities in a hierarchical series of classes. A class is defined as a collection of similar entities, where the similarity consists of the entities having attributes or traits in common. Corporate taxonomy, sometimes referred to as business taxonomy, is a hierarchy of categories used to classify documents, digital assets and other information. Taxonomies is can cover virtually any type of physical or conceptual entities (products, processes, knowledge fields, human groups, etc.).
The application of taxonomy as a means to organize business content is a complex issue. Unlike the categorizing of life forms, categorizing business documents can be ambiguous. A document could be placed in multiple categories depending on the business context or expertise of the user. This added level of classification complexity in the business setting makes the design and construction of taxonomy structures more challenging. Corporate taxonomies are increasingly used in information systems, particularly content management and knowledge management systems, as a way to allow instant access to the right information within exponentially growing volumes of enterprise data.
Enterprise data management is the development and execution of policies, practices and procedures that properly manage enterprise data. The typical enterprise has data stored in a variety of data sources. Some aspects of data management are: security and risk management, legal discovery, storage resource management (SRM), information lifecycle management (ILM) and content-based archiving. In addition, some companies have their own internal management policies. A policy is an action or a set of actions to be performed on data. For instance, a policy action could be deletion of a category of data. Another aspect of data management is data auditing. Data auditing allows enterprises to validate compliance with federal regulations and insures that data management objectives are being met.
Data analysis is the process of studying, culling, summarizing data to extract useful information. It is used by business, financial organizations, academic institutions and government agencies to acquire meaningful insights from data and predict futuristic trends. It helps to cut costs, improve revenue, reduce risks and enhance performance.
In many aspects of enterprise data management it is not enough to find a document that contains a term. What is needed is the relationship between documents and the ability to track changes in the relationship when new documents are added or existing documents are modified. For instance, when investigating payments plans, it is important to find documents that include modifications to payment plans and related contracts. Management policies should be applied to all related documents. When a policy is to retain, restore, or protect documents about payment plans, the policy should be applied to all related documents. Another example is when categorizing documents under “international commerce”, there is need to find all documents that contain terms such as foreign trade, import, export and free trade. The relationship between documents also helps an enterprise to perform data analysis that leads to good business decisions. For instance, by finding documents related to payment plans, an enterprise can analyze the data and modify contracts to maximize revenue generation or reduce risks. A solution to the problem of determining the relationship between documents and their inter-dependencies is to provide a taxonomy that organizes and classifies enterprise data in a systematic and structured manner.
Categorization of data is based on metadata or full text search. Metadata consist of information that characterizes data. Sometimes it is referred to as “data about data”. Categorization rules specify how data is classified into different groups. Data categorization methods, based on metadata, group data according to information extracted from its metadata. A few examples of such information are: the time a document was last accessed, its owner, its type and its size. There are many methods for accessing and extracting information from metadata. Some methods utilize file system utilities. File system utilities can only extract file system metadata, sometimes referred to as file system attributes. Document parsers, sometimes called filters, are used for extracting metadata from documents, such as Microsoft Word, Microsoft PowerPoint and PDF files. Examples of such metadata are: document type, publisher, author, subject, etc. The three top commercial parsers being used now are: Stellent, KeyView and iFitler. Some software developers write their own parsers or use open source parsers such as Apache POI. Results of extracting metadata are stored in a repository accessible during classification. Classification based on full text utilizes a search index built by a search engine. Full text search is used to identify documents that contain specific terms, phrases or combination of both. The result of the search is used to categorize data. One of the widely used open source search engines is Lucene.
There are many classification technologies for building corporate taxonomies. The most common ones are: Rule-based, Bayesian, Linguistic and Semantic, Support Vector Machine and neural networks. Bayesian, Linguistic and Semantic, Support Vector Machine, neural networks, and similar methods are collectively referred to as non-rule-based classification methods.
Rule-based systems define the criteria by which a document is classified. The rule measures how well a given document meets the criteria for membership in a particular class. The rule can be applied to the content or the metadata. Boolean logic is used to combine words and phrases using the words AND, OR, and NOT (otherwise known as Boolean operators) to define classification more accurately. For example, a rule could be defined as all documents in a content that include the terms “San Francisco,” OR “Los Angeles,” OR “San Diego” AND “California” be included in a category called “Cities, California”. Besides the content of documents, rules can be applied to metadata. For instance, a rule may specify that only Microsoft Word and Excel documents owned by marketing OR sales, and created during the current fiscal year be included in a category called “Marketing and Sales, Current Fiscal Year”. Rules must be carefully articulated and made as unambiguous as possible. A user may elect to provide a complex query all at once, or through a series of interactions with the system, thus potentially broadening or narrowing the scope of the query.
Bayesian probability theory is a means of quantifying uncertainty. The theorem defines a rule for refining a hypothesis by factoring in additional evidence and background information, and leads to a number representing the degree of probability that the hypothesis is true. It uses statistical models from words in training sets, and uses pattern analysis to assign the probability of correlation.
Linguistic and semantic approach is language dependent. Documents are clustered or grouped depending on meaning of words using thesauri, custom dictionaries (e.g. a dictionary of abbreviations), parts-of-speech analyzers, rule-based and probabilistic grammar, recognition of idioms, verb chain recognition, noun phrase identifiers, and linguistic or semantic clustering.
Support Vector Machine (SVM) calculates the maximum separation, in multiple dimensions of one document from another. Each document is represented as a vector. SVM continuously analyses documents and separates them into either the relevant space or the irrelevant space.
A neural network utilizes artificial intelligence to build an interconnected system of processing elements, each with a limited number of inputs and outputs. Rather than being programmed, these systems learn to recognize patterns. Neural networks are an information processing technique based on the way biological nervous systems, such as the brain, process information.
Bayesian, Linguistic and Semantic, Support Vector Machine and neural networks classification methods often require complex computation and could be referred to as computationally intensive classification methods. The complexity is exacerbated by the intricacy and size of enterprise data. In this document, the terms “computationally intensive classification” and “non-rule-based classification” are used interchangeably.
Because of data explosion, some enterprises and organizations have hundreds of millions of documents of different formats stored in hundreds of terabytes of storage. Classification of voluminous data is very costly because of the computational resources needed and the time it takes to process all data. The time required to perform data classification can run into weeks, in particular when computationally intensive classification methods are used. For this reason many existing solutions process only a subset of the data that staff members in the information technology (IT) department think it contains all relevant information. This subset is usually a small percentage, sometimes in the order of 5% of the total data. The assumption here is that excluded data does not include relevant information. This assumption is difficult to validate and is risky. Many times enterprises were fined by regulators because of their failure to comply with government regulations related to protection of all private data. The failure is due to the fact that the subset of data processed does not include all data that needs protection. In conclusion, processing a subset of the data results in inaccurate classification and that in effect results in imprecise and incomplete taxonomies.
During the last few years, many enterprises have adopted cloud computing. Cloud computing has many definitions. According to National Institute of Standards and Technology (NIST), cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. Cloud computing adoption is accelerating. For this reason it is imperative that data in a cloud computing environment should be included in taxonomies.
The procedure for building taxonomies can be divided in two approaches: rule-based classification and non-rule-based classification. The non-rule-based classification is sometimes referred to as machine learning. In the first approach, a domain expert decides on the structure of the taxonomy tree and assigns a rule-based classification to each node of the tree. Once a rule is set up, large numbers of incoming documents are automatically gauged against the rule for membership of a taxonomy node. In machine learning based classification, an administrator presents the system with positive and negative sample documents for membership in a node. Positive sample documents are expected to be included in a node. Negative sample documents are not included. From these samples the system learns and builds a model to be used in data classification, which is then used to decide whether a subsequently presented document belongs to a node. The effort for training can be significant, as the administrator may have to provide several hundred to thousands of documents for training the system. A variety of mathematical models and algorithms are used in automatic classifiers based on machine learning, including several flavors of Bayesian inference, Bayesian networks, neural networks, and Support Vector Machines.
The complexity of enterprise data and the information explosion has reached the point where no information architect within an organization can build a comprehensive taxonomy that fully describes the data in a manner similar to life science taxonomy. The problem is exacerbated by the ambiguity of languages. Even when using automated classification methods, information architects lack adequate grasp of all the various themes and topics represented in an enterprise data and for this reason cannot be relied on to extract a hierarchy of concepts and label them into a comprehensive taxonomy. Even within a single organization, it is extremely difficult, to build a single taxonomy tree that covers all enterprise data and captures important relationships between all documents. Such trees are non-existent.
Because of the prohibitive difficulty of building a single taxonomy tree that covers all enterprise data, in particular when data is voluminous, it is time for a new pragmatic approach that includes all data. The approach is based on defining taxonomy themes to solve different aspects of enterprise data management and data analysis. For each theme, a taxonomy is built by classifying data related to the theme. When data is voluminous, irrelevant data that is not related to the theme and does not impact the accuracy of the taxonomy is filtered out first. This results in less data to be processed by the taxonomy.