The present invention relates to a method within the area of information mining within a multitude of documents stored on computer systems. More particularly, the invention relates to a computerized method of generating a content taxonomy of a multitude of electronic documents.
Organizations generate and collect large volumes of data, which they use in daily operations. Yet many companies are unable to capitalize fully on the value of this data because information implicit in the data is not easy to discern. Operational systems record transactions as they occur, day and night, and store the transaction data in files and databases. Documents are produced and placed in shared files or in repositories provided by document management systems. The growth of the Internet, and its increased worldwide acceptance as a core channel both for communication among individuals and for business operations, has multiplied the sources of information and therefore the opportunities for obtaining competitive advantages. Business Intelligence Solutions is the term that describes the processes that together are used to enable improved decision making. Information mining is the process of data mining and/or text mining. It uses advanced technology for gleaning valuable insights from these sources that enable the business user making the right business decisions and thus obtaining the competitive advantages required to thrive in today""s competitive environment. Information Mining in general generates previously unknown, comprehensible, and actionable information from any source, including transactions, documents, e-mail, web pages, and other, and using it to make crucial business decisions.
Data is the raw material. It can be a set of discrete facts about events, and in that case, it is most usefully described as structured records of transactions, and it is usually of numeric or literal type. But documents and Web pages are also a source of an unstructured data, delivered as a stream of bits which can be decodified as words and sentences of text in a certain language. Industry analysts estimate that unstructured data represent 80% of an enterprise information compared to 20% from structured data; it comprises data from different sources, such as text, image, video, and audio; text, is however, the most predominant variety of unstructured data.
The IBM Intelligent Miner Family is a set of offerings that enables the business professional and in general any knowledge worker to use the computer to generate meaningful information and useful insights from both structured data and text. Although the general problems to solve (e.g.. clustering, classification) are similar for the different data types, the technology used in each case is different, because it needs to be optimized to the media involved, the user needs, and to the best use of the computing resources. For that reason, the IBM Intelligent Family is comprised of two specialized products: the IBM Intelligent Miner for Data, and the IBM Intelligent Miner for Text.
Information mining has been defined as the process of generating previously unknown, comprehensible, and actionable information from any source. This definition exposes the fundamental differences between information mining and the traditional approaches to data analysis such as query and reporting and online analytical processing (OLAP) for structured data, and from full text search for textual data. In essence, information mining is distinguished by the fact that it is aimed at the discovery of information and knowledge, without a previously formulated hypothesis. By definition, the information discovered through the mining process must have been previously unknown, that is, it is unlikely that the information could have been hypothesized in advance. For structured data, the interchangeable terms xe2x80x9cdata miningxe2x80x9d and xe2x80x9cknowledge discovery in databasesxe2x80x9d describe a multidisciplinary field of research that include machine learning, statistics, database technology, rule based systems, neural networks, and visualization. xe2x80x9cText miningxe2x80x9d technology is also based on different approaches of the same technologies; moreover it exploits techniques of computational linguistics.
Both data mining and text mining share key concepts of knowledge extraction, such as the discovery of which features are important for clustering, that is, finding groups of similar objects that differ significantly from other objects. They also share the concept of classification, which refers to finding out to which class it belongs a certain database record, in the case of data mining, or to a document, in the case of text mining. The classification schema can be discovered automatically through clustering techniques (the machine finds the groups or clusters and assigns to each cluster a generalized title or cluster label that becomes the class name). In other cases the taxonomy can be provided by the user, and the process is called categorization.
Many of the technologies and tools developed in information mining are dedicated to the task ol discovery and extraction of information or knowledge from text documents, called feature extraction. The basic pieces of information in textxe2x80x94such as the language of the text or company names or dates mentionedxe2x80x94are called features. Information extraction from unconstrained text is the extraction of the linguistic items that provide representative or otherwise relevant information about the document content. These features are used to assign documents to categories in a given scheme, group documents by subject, focus on specific parts of information within documents, or improve the quality of information retrieval systems. The extracted features can also serve as meta data about the analyzed documents. Extracting implicit data from text can be interesting for many reasons; for instance:
to highlight important information e.g. to highlight important terms in documents. This can give a quick impression whether the document is of any interest.
to find names of competitors e.g. when doing a case study in a certain business area one can do a names extraction on the documents that one has received from different sources and then sort them by names of competitors.
to find and store key concepts. This could replace a text retrieval system where huge indexes are not appropriate but only a few key concepts of the underlying document collection should be stored in a database.
to use related topics for query refinement e.g. store the key concepts found in a database and build an application for query refinement on top of it. Thus topics that are related to the users"" initial queries can be suggested to help them refine their queries.
Feature extraction from texts, and the harvesting of crisp and vague information, require sophisticated knowledge models, which tend to become domain specific. A recent research prototype has been disclosed by J. Mothe, T. Dkaki, B. Dousset, xe2x80x9cMining Information in Order to Extract Hidden and Strategic Informationxe2x80x9d, Proceedings of Computer-Assisted Information Searching on Internet, RIAO97, pp 32-51, June 1997.
A further technology of major importance in information mining is dedicated to the task of clustering of documents. Within a collection of objects a cluster could be defmed as a group of objects whose members are more similar to each other than to the members of any other group. In information mining clustering is used to segment a document collection into subsets, the clusters, with the members of each cluster being similar with respect to certain interesting features. For clustering no predefined taxonomy or classification schemes are necessary. This automatic analysis of information can be used for several different purposes:
to provide an overview of the contents of a large document collection;
to identify hidden structures between groups of objects e.g. clustering allows that related documents are all connected by hyper links;
to ease the process of browsing to find similar or related information e.g. to get an overview over documents;
to detect duplicate and almost identical documents in an archive.
Typically, the goal of cluster analysis is to determine a set of clusters, or a clustering, in which the inter-cluster similarity is minimized and intra-cluster similarity is maximized. In general, there is no unique or best solution to this task. A number of different algorithms have been proposed that are more or less appropriate for different data collections and interests. Hierarchical clustering works especially well for textual data In contrast to flat or linear clustering where the clusters have no genuine relationship, the clusters in a hierarchical approach are arranged in a clustering tree where related clusters occur in the same branch of the tree. Clustering algorithms have a long tradition. Examples and overviews of clustering algorithms may be found in M. Iwayama, T. Tokunaga, xe2x80x9cCluster-Based Text Categorization: A Comparison of Category Search Strategiesxe2x80x9d, in: Proceedings of SIGIR 1995, pp 273-280, July 1995, ACM. or in Y. Maarek, A. J. Wecker, xe2x80x9cThe Librarian""s Assistant: Automatically organizing on-line books into dynamic bookshelvesxe2x80x9d, in: Proceedings of RIAO ""94, Intelligent Multimedia, IR Systems and Management, N.Y., 1994.
A further technology of major importance in information mining is dedicated to the task of categorization of documents. In general, to categorize objects means to assign them to predefined categories or classes from a taxonomy. The categories may be overlapping or distinct, depending on the domain of interest. For text mining, categorization can mean to assign categories to documents or to organize documents with respect to a predefined organization. Categorization in the context of text mining means to assign documents to preexisting categories, sometimes called topics or themes. The categories are chosen to match the intended use of the collection and have to be trained beforehand. By assigning documents to categories, text mining can help to organize them. While categorization cannot replace the kind of cataloging a librarian does, it provides a much less expensive alternative.
State of the art technologies for taxonomy generation suffer several deficiencies, like:
the problem of navigational balance: the taxonomy must be well-balanced for navigation by an end-user. In particular, the fan-out at each level of the hierarchy must be limited, the depth must be limited, and there must not be empty nodes.
the problem of orientation: nodes in the taxonomy should reflect xe2x80x9cconceptsxe2x80x9d and give sufficient orientation for a user traversing the taxonomy.
the problem of coherence and selectivity: the leaf nodes in the taxonomy should be maximally coherent with all assigned documents having the same thematic content. Related documents from different nodes should appear within short distance in the taxonomy structure.
The most important problems of the current state of the art technologies for taxonomy generation are
the problem of scalability: any document of the collection must be assigned to some leaf node in the taxonomy and the whole taxonomy generation process must be applicable to a significantly larger number of documents and still being able to generate a taxonomy within a reasonable amount of time.
the problem of domain-independence: no hand-coded knowledge on the domain to be analyzed derived from an analysis of the given document collection to steer and to speed up the taxonomy generation process should be used.
The invention is based on the objective to improve the scalability of an taxonomy generation process allowing a taxonomy generation method to cope with increasing numbers of documents to be analyzed in a reasonable amount of time. It is a further objective of the current invention to improve said scalability and at the same time to guarantee domain-independence of the taxonomy generation method.
The current invention teaches a method of generating a content taxonomy of a multitude of documents (210) stored on a computer system and said method being executable by a computer system. The fundamental approach of the current invention comprises a subset-selection-step (201), wherein a subset of said multitude of documents is being selected. In a taxonomy-generation-step (202 to 205) a taxonomy is generated for that selected subset of documents, said taxonomy being a tree-structured taxonomy-hierarchy. Said subset is divided into a set of clusters with largest intra-similarity and each of said clusters of largest intra-similarity is assigned to a leaf-node of said taxonomy-hierarchy as outer cluster. The inner-nodes of said taxonomy-hierarchy are ordering said subset, starting with said outer clusters, into inner-clusters with increasing cluster-size and decreasing similarity. Moreover said method comprises a routing-selection-step (206), wherein for each unprocessed document of said multitude of documents not belonging to said subset its similarities with said outer-clusters are computed and said document is assigned to a leaf-node of said taxonomy-hierarchy comprising the outer-cluster with largest similarity.
The technique proposed by the current invention is able to improve at the same time the scalability and the coherence and selectivity of taxonomy generation. Scalability is provided as the taxonomy generation step, being the most time consuming part of the overall process, is operating on the selected subset of documents only. This approach alone would not be sufficient for solving the overall problem. It""s due to the rest of the features of the claim 1 that the taxonomy of a reasonable selected and reasonable sized subset of documents is already a stable taxonomy with respect to the complete multitude of documents. The introduction of a separate routing selection step allows the mass of the documents to be assigned very efficiently in an already computed taxonomy. By exploiting a hierarchical taxonomy approach the leaf nodes in the taxonomy are coherent with all assigned documents having the same thematic content and related documents from different nodes appear within short distance in the taxonomy structure. The taxonomy generated according the current invention is very stable, i.e. increasing the size of reasonable sized subset of documents will not change the taxonomy in any essential manner. Moreover the proposed method is completely domain independent, i.e. no hand-coded knowledge on the domain to be analyzed derived from an analysis of the given document collection is required to steer and to speed up the taxonomy generation process. As a result the complete taxonomy generation process is fully automatic and does not require any human intervention or adaptation.
Additional advantages are accomplished by the aspect that said taxonomy-generation-step comprises a first-feature-extraction-step (202) extracting for each document of said subset its features and computing its feature statistics in a feature vector (212) as a representation of said document.
Introducing a distinct feature extraction step increases flexibility of the proposed method as it becomes possible to exploit different feature extraction technologies depending on the intended purpose of the taxonomy, depending on the document domain and depending on the characteristics of the various feature extraction technologies. Storing the time-consuming s computation of the feature-vectors speeds up processing as the feature-vectors can be used again in later processing steps.
Additional advantages are accomplished by the aspect that said taxonomy-generation-step comprises a clustering-step (203) using a hierarchical clustering algorithm to generate said taxonomy-hierarchy and using said feature-vectors for determining similarity.
Using the hierarchical clustering algorithm, working bottom-up, i.e. which starts with clusters comprising a single document and then working upwards by merging clusters until the root clusters has been generated, guarantees good coherence and selectivity of the taxonomy. Moreover the hierarchical clustering algorithm provides good orientation for a user traversing the taxonomy form xe2x80x9chigherxe2x80x9d nodes, i.e. nodes representing more abstract concepts, to xe2x80x9clowerxe2x80x9d nodes, i.e. nodes representing more concrete document information, and vice versa.
Additional advantages are accomplished by the aspect that said taxonomy-generation-step comprises a categorization-training-step (205) computing for each of said clusters a category-scheme (215) as a representation of said cluster and wherein said category-scheme comprising the feature-statistics of said cluster being calculated from said feature-vectors of each document of said cluster.
By combining the feature-vectors of the documents forming a certain cluster to a single quantity, the category-scheme, reflecting the comprised feature statistics, a cluster can be treated in certain ways xe2x80x9clike a single documentxe2x80x9d, which speeds up similarity comparison of an unprocessed document in the routing-selection-step significantly.
Additional advantages are accomplished by the aspect that said routing-step comprises a parallel feature-extraction step extracting for each of said unprocessed documents its features and computing its feature statistics in a feature-vector as a representation of said unprocessed document. In said routing-step said similarities between said unprocessed document and each of said outer-clusters is computed by comparing said feature-vector of said unprocessed document with said category-scheme of said cluster.
Based on above approach similarity calculations of an unprocessed document with respect to an outer-cluster is very effective as only two feature-vectors have to be compared, the feature-vector of the recently processed document and the category-scheme.
Based on above approach similarity calculations of an unprocessed document with respect to an outer-cluster is very effective as only two feature-vectors have to be compared, the feature-vector of the unprocessed document and the category-scheme.
Additional advantages are accomplished by the aspect that said first-feature-extraction-step and/or said second-feature-extraction-step extract features based on lexical affinities within said documents.
Exploiting lexical affinity technology allows the proposed method to determine (in a domain independent manner) multi-word phrases which have a much higher semantic meaning compared to the single terms. Thus orientation for the users is improved as the taxonomy is able to reflect xe2x80x9cconceptsxe2x80x9d.
Additional advantages are accomplished by the aspect that said first-feature-extraction-step and/or said second-feature-extraction-step extract features based on linguistic features within said documents.
Exploiting linguistic features technology allows the proposed method to determine (in a domain independent manner) names of people, organizations, locations, domain terms (multi-word terms) and other significant phrases from the text. Different variants are associated with a single canonical form. Thus in cases where xe2x80x9cnamesxe2x80x9d are of importance the proposed feature improves orientation and selectivity for the users.
Additional advantages are accomplished by the aspect that said lexical affinities are extracted with a window of M words to identify co-occurring words.
Allowing to adjust the window size to determine lexical affinities gives the freedom to control processing time versus complexity of extracted features. M being a natural number with 1 less than Mxe2x89xa65 represents a reasonable trade-off.
Additional advantages are accomplished by the aspect that extracted features, which occurred with a high statistical frequency and/or with a low statistical frequency, are ignored.
Using this approach the proposed method is able to concentrate on features which a high selective property. Ignoring high frequency features allows to exclude features which occur in many documents and thus are almost common to the documents. Filtering low-frequency terms avoids over fitting at the lower end of the taxonomy.
Additional advantages are accomplished by the aspect that the depth of the taxonomy-hierarchy is limited to L levels by the slicing technique merging most similar clusters into common cluster until said taxonomy-hierarchy comprises L levels.
This slicing is of importance to provide good orientation to a user. By merging clusters higher levels of abstraction, forming more general xe2x80x9cconceptsxe2x80x9d, are introduced. Moreover navigational balance in the taxonomy hierarchy is provided. An adaptable parameter L introduces freedom to tune the methodology. The parameter L being natural number from the range 1xe2x89xa6Lxe2x89xa612 represents a reasonable trade-off.
Additional advantages are accomplished by the aspect that said taxonomy-generation-step comprises a labeling-step (204) labeling each node in the taxonomy-hierarchy.
Labeling the generated taxonomy is of central advantage for usability of the generated taxonomy.
Additional advantages are accomplished by the aspect that the N most frequent features of a cluster of a node in the taxonomy-hierarchy are used as labels.
For disambiguation of overlapping semantic concepts, the first N distinguishing high frequent features are displayed in addition to the common features. Thus related concepts like xe2x80x9ccomputer hardwarexe2x80x9d and xe2x80x9ccomputer softwarexe2x80x9d are displayed as xe2x80x9ccomputer, hardwarexe2x80x9d and xe2x80x9ccomputer, softwarexe2x80x9d, if they happen to occur together in a cluster structure. This approach turned out to be a good approximation of the xe2x80x9cconceptsxe2x80x9d enclosed within the documents of a cluster.
Additional advantages are accomplished by the aspect that said subset of said multitude of documents is determined by random selection.
Random selection is a very good approach for making sure that the subset, being the base for generated taxonomy, includes documents comprising the statistically most relevant features.
Additional advantages are accomplished by the aspect that the range of the document dates is divided in equally sized sub-ranges and said random selection is performed separately for documents with document dates from said sub-ranges.
Using the proposed approach guarantees that changes of the terms used within the documents (xe2x80x9cevolution of terminologyxe2x80x9d), which occur over time, are fully reflected in the generated taxonomy and contribute to the establishment of concepts.
Additional advantages are accomplished by the aspect that said subset comprises up to 10% of said multitude of said documents.
According to the proposed methodology the taxonomy generated based upon said subset becomes stable, i.e. increasing the size of the subset will not change the generated taxonomy in an essential manner. The computing effort for taxonomy generation is thus reduced significantly. dr
FIG. 1 shows a graphical representation of a dendrogram generated by the hierarchical clustering algorithm before and after applying the slicing technology.
FIG. 2 gives an overview of the process architecture of the proposed method of generating a content taxonomy.
FIG. 3 shows an example of a node of the generated taxonomy for the current example taken from the lexical-affinity-based taxonomy.
FIG. 4 shows an example of a node of the generated taxonomy for the current example taken from the linguistic-feature-based taxonomy.