The present invention relates generally to extraction of information from databases, and specifically to text mining in unstructured databases.
Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large quantities of data, and on the discovery of interesting patterns within them. While most work on KDD has been concerned with analyzing structured databases, there has been relatively little development of methods for analyzing the large quantity of information that is currently available only in unstructured, text-based form. An example of work in this latter category is described in xe2x80x9cMining Text Using Keyword Distributions,xe2x80x9d by Ronen Feldman, Ido Dagan, and Haym Hirsh, Proceedings of the 1995 Workshop on Knowledge Discovery in Databases, which is incorporated herein by reference. Other work is described in xe2x80x9cFinding Associations in Collections of Text,xe2x80x9d by Ronen Feldman and Haym Hirsh, Machine Learning and Data Mining: Methods and Applications, edited by R. S. Michalski, I. Bratko, and M. Kubat, John Wiley and Sons, Ltd., 1997, which is also incorporated herein by reference.
A paper entitled xe2x80x9cTechnology Text Mining, Turning Information Into Knowledge: A White Paper from IBM,xe2x80x9d edited by Daniel Tkach, Feb. 17, 1998, which is incorporated herein by reference, describes a program called IBM Intelligent Miner for Text, which extracts terms from unstructured text. xe2x80x9cTerms,xe2x80x9d in the context of the present application, are single words, or short strings of highly-related, linked words, such as xe2x80x9cBiotechnology,xe2x80x9d xe2x80x9cNew York Stock Exchange,xe2x80x9d xe2x80x9cFree market,xe2x80x9d or xe2x80x9cHealth programs.xe2x80x9d xe2x80x9cTerm extraction,xe2x80x9d in the context of the present application, refers to the process of finding terms in a document that have relevance to the content of the document.
InQuery 5.0, produced by Sovereign Hill Software, uses term extraction to identify names of companies and people in one or more documents. The extracted terms are used to enable a search engine to find desired documents responsive to a user""s query.
A paper entitled xe2x80x9cText Mining at the Term Level,xe2x80x9d by Feldman et al., Proceedings of the 1998 Workshop on Knowledge Discovery in Databases, August, 1998, which is incorporated herein by reference, the authors of which are the inventors of the present invention, describes a method for extracting terms from a document in a database, filtering out unimportant terms, and subsequently performing text mining in the database. xe2x80x9cText mining,xe2x80x9d in the context of the present application, refers to a substantially automated process of extracting useful information from a collection of textual data.
Standard text mining systems typically process documents which have been xe2x80x9ccategorized,xe2x80x9d i.e., manually-or automatically-assigned keywords (xe2x80x9ctagsxe2x80x9d) in order to identify their content. Automatic tagging is generally performed by matching words in a document with words from a predetermined list.
It is an object of some aspects of the present invention to provide improved methods for text mining.
It is a further object of some aspects of the present invention to provide improved methods for comparing multiple documents in a database.
It is yet a further object of some aspects of the present invention to provide improved methods for extracting information from multiple documents in a database.
In preferred embodiments of the present invention, a system for mining text in a database comprises a memory, which stores a hierarchical taxonomy of terms, and a processor, which uses the taxonomy to perform effective mining of the database. Preferably, the system enables quantitative, content-based, textual analysis of a large number of documents in the database, in order to present relationships between two or more entries in the taxonomy.
Preferably, a user provides an input indicating terms of interest (some or all of which may be in the taxonomy), and the processor subsequently discovers relationships between terms in the user""s input and terms in the taxonomy. Typically, relationships discovered during text mining comprise co-occurrences of two terms in a single document. Preferably, if the user xe2x80x9cselectsxe2x80x9d one of the relationships generated by the text analysis, the system displays relevant portions of original documents in the database which are associated with the discovered relationship.
In some preferred embodiments of the present invention, terms in the term taxonomy (xe2x80x9ctaxonomy termsxe2x80x9d) can be edited by the user prior to text mining, and the taxonomy can be modified automatically by the processor and/or interactively with the user, responsive to results of the text mining. Typically, interactive editing of the term taxonomy responsive to results of the text mining yields improved results from a subsequent iteration of text mining, and these improved results may themselves be used to modify the taxonomy again. In this manner, the user may derive information of increased value from each iteration of text mining and term taxonomy modification.
In some preferred embodiments of the present invention, the taxonomy generally has a Directed Acyclic Graph (DAG) structure or a tree structure, and comprises groups of related terms (siblings) stored in the hierarchy one level below respective parent entries. For example, under a parent entry, xe2x80x9cCountries,xe2x80x9d the taxonomy may contain as daughter entries the list of member nations of the United Nations. (The parent entry xe2x80x9cCountriesxe2x80x9d may itself also be a member of a set of siblings in the taxonomy, under a xe2x80x9cgrandparentxe2x80x9d entry, xe2x80x9cPolitical entities.xe2x80x9d) Prior to text mining, in this example, the user may add the name of a new member nation, or delete the name of a country whose name has changed. Following text mining of the database, and utilizing results derived therefrom, the user may choose to further edit the term taxonomy (for instance, by adding a new country name or variation thereof).
In some preferred embodiments, the taxonomy has multiple levels, and a broad range of terms in each level, so that the user can narrow or broaden a query prior to an iteration of text mining, in order to optimize the results generated by the processor. For example, if the user would like to investigate President Clinton""s foreign policy, she might enter an initial query specifying xe2x80x9cClintonxe2x80x9d and all daughter entries of the node xe2x80x9cCountries.xe2x80x9d To broaden the query, xe2x80x9cCountriesxe2x80x9d could be replaced by xe2x80x9cPolitical entities,xe2x80x9d so that a news article, containing the words xe2x80x9cBerlinxe2x80x9d and xe2x80x9cParis,xe2x80x9d but not xe2x80x9cGermanyxe2x80x9d and xe2x80x9cFrance,xe2x80x9d would also generate a positive response to the query. Alternatively, to narrow the query, the user could specify a taxonomy node xe2x80x9cG7 countries,xe2x80x9d instead of xe2x80x9cCountries.xe2x80x9d In general, a rich, multilevel taxonomy enables the user to enter queries with a desired level of specificity, and to thereby obtain information most relevant to her needs.
In a preferred embodiment, the processor prompts the user to refine the query prior to mining of the database""s text, in order to optimize the results generated by the processor. For example, if the user enters a query including the words xe2x80x9cColombiaxe2x80x9d and xe2x80x9cVenezuela,xe2x80x9d the processor preferably examines the taxonomy, determines that the two terms are daughter entries of a parent entry, xe2x80x9cSouth American countries,xe2x80x9d and asks the user whether the two specified terms should be replaced by the names of all of the countries in South America listed in the taxonomy. Alternatively or additionally, the processor examines daughter entries of xe2x80x9cColombiaxe2x80x9d and xe2x80x9cVenezuela,xe2x80x9d and asks the user whether some or all of the daughter entries (for instance, names of cities or politicians) should be added to the query.
In preferred embodiments of the present invention, text mining typically includes determining relationships among terms found in the database which relate to the user""s query. Preferably, according to some preferred embodiments of the present invention, the processor subsequently uses these discovered relationships in order to suggest modifications to the taxonomy. For example, if the user""s query includes the word xe2x80x9cVenezuelaxe2x80x9d and a taxonomy node xe2x80x9cNatural resources,xe2x80x9d then text mining of the database may determine that the terms xe2x80x9cCrude oil,xe2x80x9d xe2x80x9cCoffee,xe2x80x9d xe2x80x9cSugarcane,xe2x80x9d and xe2x80x9cBananasxe2x80x9d occur with high frequency in documents in the database having the word xe2x80x9cVenezuelaxe2x80x9d and at least one daughter entry of xe2x80x9cNatural resourcesxe2x80x9d. If, from this list, only xe2x80x9cSugarcanexe2x80x9d is not a daughter entry of xe2x80x9cNatural resources,xe2x80x9d then the processor preferably prompts the user to indicate whether xe2x80x9cSugarcanexe2x80x9d should be added to the taxonomy as a daughter entry of xe2x80x9cNatural resourcesxe2x80x9d. Should the user agree, then in processing a subsequent query including, for example, a taxonomy node xe2x80x9cEcological issuesxe2x80x9d and the taxonomy node xe2x80x9cNatural resourcesxe2x80x9d, the processor will already xe2x80x9cknowxe2x80x9d that sugarcane is a natural resource. In this manner, useful information derived by the text mining process is reported to the user, and is additionally used to improve the taxonomy in order to enhance the effectiveness of subsequent mining of the same or a different database.
Alternatively or additionally, the results of text mining may indicate to the user that a new node should be added to the taxonomy, or that an existing node should be supplemented in light of the generated results. For example, during text mining of a news database, the inventors entered a query including the term, xe2x80x9cFord Motor Corp.,xe2x80x9d and the taxonomy node, xe2x80x9cCompanies,xe2x80x9d so that the processor would generate a list of companies ranked by their frequency of co-occurrence with Ford. Most of the top 10 companies listed were car companies, and this might suggest to the user to create a new node, xe2x80x9cCar companies,xe2x80x9d and to copy the appropriate companies into the new node.
In some preferred embodiments of the present invention, relationships in the database found during text mining yield knowledge about general informational content, or about specific facts or specific events inherent in the text of the documents. For example, a query including the above-mentioned parent entry, xe2x80x9cCountries, xe2x80x9d and an additional term, xe2x80x9cCrude oil,xe2x80x9d was used by the inventors to find that Iran, Saudi Arabia, and the United States regularly appeared in news stories with one or more of at least five other countries, whereas Japan, for instance, only appeared a significant number of times in news stories mentioning Iran. Notably, this potentially useful knowledge is produced by a program implementing the principles of the present invention without requiring the user to ask specific questions. Rather, correlations and relationships discovered during text mining are preferably output automatically, in an appropriate form (e.g., text, table, or graph), in order to yield information relevant to the user""s query.
Additionally, a specific fact, for example, that xe2x80x9cBill Clintonxe2x80x9d is the xe2x80x9cPresidentxe2x80x9d of xe2x80x9cthe United States of America,xe2x80x9d can be deduced in these embodiments from a sentence in the middle of a document in the database, xe2x80x9cUS President Bill Clinton addressed a trade meeting on Thursday.xe2x80x9d Typically, xe2x80x9cPresidentxe2x80x9d is in a list of relationship terms which are known by a program implementing these embodiments to link a person""s name and a country or company. Additionally, a list of synonyms (xe2x80x9cUSxe2x80x9d=xe2x80x9cthe United States of Americaxe2x80x9d) is preferably already known to the processor.
In a similar manner, specific events can be extracted from a document""s text. For example, xe2x80x9cmergerxe2x80x9d is a relationship term known to link the names of two companies. If the word xe2x80x9cmergerxe2x80x9d were found in the text of a document, the program would scan the xe2x80x9cCompaniesxe2x80x9d node in the taxonomy and report if two known company names are found in the vicinity of xe2x80x9cmerger,xe2x80x9d and are grammatically linked to xe2x80x9cmergerxe2x80x9d according to predetermined rules.
In a further example, text mining of a news database comparing all documents mentioning xe2x80x9cMicrosoft,xe2x80x9d and the subset of those documents which also contain xe2x80x9cJustice Department,xe2x80x9d may reveal that the term xe2x80x9cExplorerxe2x80x9d is correlated more strongly with documents in the latter set than with documents in the former set. Preferably, this information is automatically revealed by the program, and thus may reveal a useful fact which the user might not have known. Text mining, according to the present invention, allows the user to xe2x80x9cminexe2x80x9d a database for potentially useful, unknown, and perhaps unsuspected, information, by enabling her to: discover significant correlations between a term in a query and one or more other terms in the database; find time-based trends of a given term or of its correlation with a second term; and compare two or more terms with respect to another term.
There is therefore provided in accordance with a preferred embodiment of the present invention, a method for mining in a database including documents that include text. The method comprises providing a taxonomy of taxonomy terms, mining the documents responsive to the taxonomy to discover a relationship between two or more of the taxonomy terms, analyzing occurrences of the relationship over a plurality of the documents to extract information not specified by the taxonomy relation to the two, or more taxonomy terms, and persenting the information relating to the two or more taxonomy terms to a user.
In a preferred embodiment, analyzing occurrences of the relationship comprises identifying, responsive to the taxonomy, one of: a fact and an event, inherent in the text of one of the documents. The fact or event may be identified by the proximity, in one of the documents, of the two or more taxonomy terms to a predetermined relationship term.
Preferably, the taxonomy comprises nodes and one of the nodes is a parent entry of the two or more taxonomy terms. Alternatively or additionally, the taxonomy comprises a hierarchy of nodes, wherein a first node is a parent entry of at least one of the two or more taxonomy terms, wherein a second node is also a parent entry of at least one of the two or more taxonomy terms, and wherein the relationship comprises a relationship between the second node and the first node.
Preferably, analyzing comprises analyzing, over a plurality of the documents, co-occurrences of substantially every one of the two or more taxonomy terms with substantially every other one of the two or more taxonomy terms, to determine relationships among the two or more taxonomy terms; and presenting the information comprises displaying at least some of the two or more taxonomy terms and displaying an output indicative to the number of the co-occurrences of substantially each of the at least some terms with substantially every other one of the at least some terms.
In one embodiment, the method further comprises displaying a portion of the text of at least one of the documents responsive to the discovered relationship.
Preferably, mining includes extracting a set of one or more document-labeling terms from one of the documents, wherein the set includes the two or more taxonomy terms. Extracting the set of document-labeling terms from the document may comprise determining the grammatical structure of a sentence in the document""s text and identifying a group of one or more words in the sentence as a document-labeling term responsive to the grammatical structure. Preferably, extracting the set of document-labeling terms from the document may comprises: examining the document, identifying a candidate term in the document, comparing a frequency of occurrence of the candidate term in the document with frequencies of occurrence of the candidate term in other documents in the database to determine differences in the respective frequencies of occurrence, and inserting the candidate term into the set of document-labeling terms corresponding to the document responsive to the comparison.
In one preferred embodiment, discovering the relationship comprises finding in at least one of the documents a co-occurrence of at least some portion of the two or more taxonomy terms. The two or more taxonomy terms may comprise a cluster of taxonomy terms, wherein the cluster is characterized by the property that terms in the cluster generally have a higher frequency of co-occurrence with respect to each other than their frequency of co-occurrence with respect to terms not in the cluster. Also, discovering the relationship may comprise assigning a weight to the co-occurrence responsive to a distance between a first term and a second term of the two or more taxonomy terms, and wherein analyzing occurrences of the relationship comprises analyzing the relationship responsive to the weight. For example, a first term and a second term of the two or more taxonomy terms may co-occur in a document when a distance between the first and second terms is less than a predetermined distance. The predetermined distance may be, for instance, approximately one paragraph or approximately one sentence.
In another embodiment, discovering the relationship comprises discovering a plurality of relationships in two or more of the documents where analyzing comprises: analyzing the relationships in a first set of the two or more documents, in order to determine a first relationship between the two or more taxonomy terms; analyzing the relationships in a second set of the two or more documents, in order to determine a second relationship between the two or more taxonomy terms; and comparing the first and second relationships. The first set of documents may comprise documents from a first time period, and the second set of documents may comprise documents from a second time period.
In a preferred embodiment, at least one of the two or more taxonomy terms is selected by the user, and, in one embodiment, each of the two or more taxonomy terms is selected by the user. Also, presenting the information may comprise displaying a graph comprising a plurality of points, each point representing one of the two or more taxonomy terms, and one or more lines, each line connecting two of the points and indicating a quantitative relationship between the terms represented by said two points. For example, the thickness of each line in the graph, a number displayed near each line in the graph, or the color of each line in the graph may indicate the quantitative relationship. The quantitative relationship indicated by a line in the graph is preferably a co-occurrence frequency of the terms represented by the two points connected by that line.
There is further provided, in accordance with a preferred embodiment of the present invention, a method for mining in a database including documents, the documents including text. The method comprises providing a taxonomy of taxonomy terms, mining the documents responsive to the taxonomy to discover a relationship between a set of one or more selected words and at least one of the taxonomy terms, and modifying the taxonomy responsive to the discovered relationship. The taxonomy may comprise a hierarchy of nodes, wherein the at least one taxonomy term comprises two or more related taxonomy terms, one of the nodes is a parent entry of the two or more taxonomy terms, and wherein modifying comprises assigning one of the selected words to be a sibling of the two or more taxonomy terms responsive to the discovered relationship.
The present invention additionally provides a method for mining in a database including documents, the documents including text, where the method comprises providing a taxonomy of taxonomy terms, mining the documents at a taxonomy term level to provide mining results indicative of a relationship between a plurality of terms including at least one taxonomy term, wherein at least one of the terms is specified by a user, and prompting the user to modify the taxonomy based on the mining results. The method preferably further comprises performing a statistical analysis on the mining results to determine a potential modification of the taxonomy and prompting the user on whether or not to carry out the potential modification of the taxonomy. At least one of the plurality of terms may be received in a query entered by the user. In one embodiment, at least one of the plurality of terms is a taxonomy term specified by the user.
The present invention yet further provides a method for mining in a database including documents, the documents including text, with the method comprising providing a taxonomy of taxonomy terms; receiving a query from a user, the query specifying at least one term of interest; mining the documents at a taxonomy term level to provide a first set of mining results indicative of a relationship between the at least one term of interest and at least one of the taxonomy terms; modifying the taxonomy based, at least in part, on the first set of mining results; and mining the documents at a modified taxonomy term level to provide a second set of mining results indicative of the relationship. The method may further comprise, in response to the query, displaying to the user a portion of the taxonomy relevant to the at least one term of interest, and enabling the user to revise the query by including in the query at least one of the taxonomy terms in the displayed portion.
The present invention still further provides a method for mining in a database including documents, the documents including text, the method comprising providing a taxonomy of taxonomy terms; receiving an initial query from a user, the query specifying at least one term of interest; displaying to the user a portion of the taxonomy relevant to the at least one term of interest; receiving an indication from the user to revise the query by including in the query at least one taxonomy term in the displayed portion; and mining the documents at a taxonomy term level based on the revised query to provide mining results indicative of a relationship between the at least one taxonomy term and one or more other terms.
The present invention also provides a method for mining in a database including documents, the documents including text, the method comprising providing a taxonomy of taxonomy terms; mining the documents at a taxonomy term level to provide mining results indicative of a relationship between a plurality of terms including at least one taxonomy term, wherein at least one of the terms is specified by a user; and presenting the mining results to the user by displaying a graph comprising a plurality of points, each point representing one of the plurality of terms, and one or more lines, each line connecting two of the points and indicating a quantitative relationship between the terms represented by said two points.
The present invention additionally provides apparatus and computer program products for carrying out the above-described methods.