The present invention relates generally to knowledge discovery in collections of data, and specifically to text mining.
In recent years, the volume of text documents available on computers and computer networks is growing rapidly. It is virtually impossible to read all the available documents containing information of importance on a given subject. In order to find desired information, search engines have been developed which provide a user with documents which mention selected words or terms. The user may use Boolean patterns with xe2x80x9cand,xe2x80x9d xe2x80x9corxe2x80x9d and xe2x80x9cnotxe2x80x9d terms to more distinctly define the scope of the desired documents. However, the user cannot always define precisely which are the desired documents or keyword combinations. In addition, search engines do not provide an integrated picture of the distribution and impact of given terms in an entire corpus of documents.
Text mining is used to find hidden patterns in large textual collections. Text mining tools provide a human-tangible description of the information included in the textual collection. Because the amount of information is so large, a crucial feature of text mining tools is the way the information is organized and/or displayed. To limit the amount of information that a user must digest, it is common to define a context group which defines the information of interest to the particular user. Normally, the context group includes those documents which include one or more terms from a user-defined set.
A central tool in text mining is visualization of the complex patterns that are discovered. One such visualization approach is described, for example, in an article by Feldman R., Klosgen W., and Zilberstien A., entitled xe2x80x9cvisualization Techniques to Explore Data Mining Results for Document Collections,xe2x80x9d in Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (1997), pp. 16-23, which is incorporated herein by reference. This article describes a concept relationship analysis in which a set of concepts (or terms) are searched for in a corpus of textual data formed of a plurality of documents. The concept relationship analysis searches for groups of concepts which appear together in relatively large numbers of documents, and these concepts are displayed together.
One method of representing concept relationships is by displaying context graphs. In context graphs, the concepts (or terms) which appear together in large numbers of documents are designated by nodes. Each two nodes are connected by an edge which has a weight which is equal to the number of documents in which the terms of both nodes appear together. In order to limit the amount of data displayed, only edges which have a weight above a predetermined threshold are displayed. In some context graphs, the concepts which appear in nodes are chosen from a list of interesting terms defined by the user.
In many cases, the corpus of documents is formed of several groups of documents, for example, documents from different dates, and it is desired to apprehend concept relationships as they develop in time. An article by Lent B., Agrawal R., and Srikant R., entitled xe2x80x9cDiscovering Trends in Text Databases,xe2x80x9d in Proceedings of the 3rd International Conference of Knowledge Discovery and Data Mining (1997), pp. 227-230, which is incorporated herein by reference, describes a method of detecting trends in textual collections formed of documents with timestamps, which are partitioned into time groups according to a selected granularity. The textual collection is mined for a group of combinations of words (referred to as phrases) which appear in the documents of the collection. Each combination is given frequency-of-occurrence values for each time group. A user requests to view the frequencies of occurrence of those combinations for which the occurrences follow a desired pattern. However, this method does not give the user any feel for the development of trends in the textual documents as a whole.
In an article entitled xe2x80x9cTrend graphs: Visualizing the evolution of concept relationships in large document collections,xe2x80x9d by Feldman R., Aumann Y., Zilberstien A., and Ben-Yehuda Y., in Proceedings of the 4th International Conference of Knowledge Discovery and Data Mining (1998), which is incorporated herein by reference, a graphical tool is described for analyzing and visualizing dynamic changes in concept relationships over time.
It is an object of the present invention to provide methods and apparatus for displaying trends that are discovered in large collections of information.
In some aspects of the present invention, the trends relate to appearances of terms found by text mining in groups of documents.
It is another object of some aspects of the present invention to provide methods and apparatus for displaying the evolution of concept relationships in groups of documents.
It is another object of some aspects of the present invention to provide methods and apparatus for displaying differences between patterns of term appearances in different groups of documents.
It is still another object of some aspects of the present invention to provide methods and apparatus for determining major changes in patterns of term appearances in groups of documents.
In preferred embodiments of the present invention, a corpus of documents is divided into sub-groups defined by a differentiating parameter, such as the dates of the documents, or their origin. For each sub-group of documents, a separate context graph is prepared, and the relationship between the graphs is calculated.
In some preferred embodiments of the present invention, the differentiating parameter defines an order of the context graphs. The context graphs are preferably displayed sequentially, either one after another or one above the other. Each graph is preferably displayed with indications which show the differences between the present graph and the previous graph. Preferably, each edge in the graph is marked to indicate a difference between its weight in the present graph and its weight in the previous graph. Alternatively or additionally, each edge is marked to indicate the difference between its weight in the present graph and its average weight in a predetermined number of previous graphs.
Preferably, the edges are marked graphically, for example, using different colors, widths, and/or lengths to indicate the weight differences. In a preferred embodiment of the present invention, four indications are used for the following groups of edges: new edges, edges with increased weights, edges with decreased weights, and edges with substantially stable weights.
In some preferred embodiments of the present invention, the differentiating parameter is the date of the documents. Preferably, all the documents from a single period are considered to belong to a single sub-group. The periods may be of substantially any length, e.g., from minutes to years, according to a user selection. Alternatively or additionally, the differentiating parameter comprises the origins of the documents, such as the authors, editors, countries of origin or -the original languages of the documents. Further alternatively or additionally, substantially any other parameter may be used, such as the length of a document, or the average salary or number of employees of the company mentioned most frequently in a document.
In a preferred embodiment of the present invention, the context graphs are displayed such that all nodes that are common to two or more of the graphs appear in substantially the same relative locations in the graphs. Therefore, the layout of the displayed form of the context graphs is prepared after all the nodes of all the graphs are known. Alternatively, the locations of the nodes and/or the distances between the nodes are used to indicate the importance of the terms of the nodes. In such cases, animation techniques are preferably used to aid the user to follow the changes in the positions of the nodes.
In some preferred embodiments of the present invention, an animation sequence is used to display the changes between the context graphs. Alternatively or additionally, the context graphs are listed, for example, in a list box, and the user can choose which context graph should be displayed relative to which other graphs. Further alternatively or additionally, a plurality of context graphs are superimposed one over the other, and each graph is displayed using a different color.
In some preferred embodiments of the present invention, the corpus of documents includes a set of documents selected by a search engine, a clustering program, or by any other method of filtering and/or gathering of documents. Furthermore, the trend graphs produced in accordance with preferred embodiments of the present invention may be used to select groups of documents on which additional filtering and/or other processing is to be performed.
Although preferred embodiments are described herein with reference to mining and analysis of text documents, those skilled in the art will appreciate that the principles of the present invention may also be applied to visualization of trends and other variations in collections of information of other types. For example, trends occurring among the records in a large database may be analyzed and visualized in similar fashion.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for visualizing variations in a corpus of information, including a plurality of information entries which are divided into a plurality of sub-groups according to a differentiating parameter of the entries, including:
for each of the entries, extracting characteristics of information contained therein;
finding pairs of different characteristics that appear together in at least one of the entries;
determining an occurrence value for each of the pairs of characteristics in each sub-group in which both of the characteristics appear;
comparing the occurrence values of at least some of the pairs of characteristics for at least two of the sub-groups; and
providing an indication of the comparative occurrence values of the pairs.
Preferably, the entries include text documents, and the characteristics include terms appearing in the documents.
Further preferably, determining the occurrence value includes counting the number of entries in which the pair appears.
Still further preferably, finding the pairs of characteristics includes finding pairs of characteristics which appear together in at least a predetermined number of the entries.
In a preferred embodiment, finding the pairs of characteristics includes finding pairs of characteristics which appear together in at least two of the sub-groups.
Preferably, extracting the characteristics includes automatically mining the corpus to extract characteristics therefrom.
In a preferred embodiment, the differentiating parameter defines an order, and comparing the occurrence values includes comparing the occurrence values in a first sub-group with the occurrence values in one or more previous sub-groups in the order. Preferably, comparing the occurrence values includes comparing the occurrence values in the first sub-group with the occurrence values in a closest previous sub-group. Alternatively or additionally, comparing the occurrence values includes comparing the occurrence values in the first sub-group with an average of the occurrence values in the one or more previous sub-groups. Further alternatively or additionally, providing the indication includes displaying a symbol which indicates a measure of evolution in the occurrence value in the first sub-group relative to the occurrence values in the one or more previous sub-groups in the order.
In a preferred embodiment, providing the indication includes displaying a table or graph. Preferably, displaying the graph includes displaying a graph in which each term is represented by a node, the pairs of characteristics that are found are represented by edges, and substantially each edge is associated with the indication of the comparative appearance of the respective pair. Typically, displaying the graph includes displaying with substantially each edge a weight of the edge, which equals the occurrence value of the respective pair in a first sub-group. Alternatively or additionally, displaying the graph includes displaying the graph such that the lengths of the edges represent the occurrence value of the respective pair in a first sub-group.
In a preferred embodiment, displaying the graph includes displaying for each two sub-groups a graph which compares the occurrence values in the two sub-groups. Preferably, displaying the graph for each two sub-groups includes displaying the graphs such that nodes which represent the same term are displayed in substantially the same relative location. Further preferably, the graphs of each two sub-groups are displayed as an animation sequence.
Preferably, displaying the graph includes displaying a plurality of superimposed graphs, each of which represents the appearances of the pairs in a different sub-group. Further preferably, displaying the plurality of superimposed graphs includes displaying each of the graphs in a different color.
In a preferred embodiment, providing the indication of the comparative values of the pairs includes providing an indication wherein which pairs having a characteristic in common are grouped together.
There is also provided, in accordance with a preferred embodiment of the present invention, apparatus for visualizing variations in a corpus of information including a plurality of information entries which are divided into a plurality of sub-groups according to a differentiating parameter of the entries, including:
a processor which finds pairs of characteristics which appear together in at least one of the documents, determines an occurrence value for each of the pairs of characteristics in each sub-group in which both of the characteristics appear, and compares the occurrence values of at least some of the pairs of characteristics for at least two of the subgroups; and
a display which displays an indication of the comparative occurrence values of the pairs.
In a preferred embodiment, the processor finds characteristics selected from a group of automatically determined characteristics.
There is further provided, in accordance with a preferred embodiment of the present invention, a method for selecting a range of values of a variable, including:
providing a graphic user interface on a display, including a slide-piece that has an initial dimension and is translatable along an axis representing the variable such that each position of the slide-piece along the axis corresponds to a given value of the variable;
positioning the slide-piece at a first position on the axis, so as to indicate a first value of the variable; and
changing the dimension of the slide-piece so as to indicate a second value of the variable, whereby the first and second values of the variable define the selected range.
Preferably, changing the dimension of the slide-piece includes changing a length of the slide-piece along the axis. Further preferably, the first and second values of the variable include the extrema of the range.
There is still further provided, in accordance with a preferred embodiment of the present invention, a computer program product for visualizing variations in a corpus of information, including a plurality of information entries which are divided into a plurality of sub-groups according to a differentiating parameter of the entries, the documents including text, the program having computer-readable program instructions embodied therein, which instructions cause a computer to:
for each of the entries, extract characteristics of information contained therein;
find pairs of different characteristics that appear together in at least one of the entries;
determine an occurrence value for each of the pairs of characteristics in each sub-group in which both of the characteristics appear;
compare the occurrence values of at least some of the pairs of characteristics for at least two of the sub-groups; and
provide an indication of the comparative occurrence values of the pairs.
There is also provided, in accordance with a preferred embodiment of the present invention, a computer program product for selecting a range of values of a variable, the program having computer-readable program instructions embodied therein, which instructions cause a computer to:
provide a graphic user interface on a display, including a slide-piece that has an initial dimension. and is translatable along an axis representing the variable such that each position of the slide-piece along the axis corresponds to a given value of the variable;
position the slide-piece at a first position on the axis, so as to indicate a first value of the variable; and
change the dimension of the slide-piece so as to indicate a second value of the variable, whereby the first and second values of the variable define the selected range.
There is additionally provided, in accordance with a preferred embodiment of the present invention, a method for extracting data from a corpus of information, including a plurality of information entries, each entry being assigned to one or more sub-groups according to a differentiating parameter of the entries, including:
for a first one of the entries in a first one of the sub-groups, extracting a characteristic of information contained therein;
for a second one of the entries in a second one of the sub-groups, extracting the same characteristic of information;
automatically determining respective first and second occurrence values corresponding to the characteristic in the first and second sub-groups; and
providing an indication of the occurrence values.
Preferably, providing the indication includes providing a visual indication of the occurrence values. Further preferably, the differentiating parameter includes a sequence, most preferably a time sequence.
There is still additionally provided, in accordance with a preferred embodiment of the present invention, apparatus for extracting data from a corpus of information including a plurality of information entries, each entry being assigned to one or more sub-groups according to a differentiating parameter of the entries, including:
a processor, which (a) for a first one of the entries in a first one of the sub-groups, extracts a characteristic of information contained therein, (b) for a second one of the entries in a second one of the sub-groups, extracts the same characteristic of information, and (c) automatically determines respective first and second occurrence values corresponding to the characteristic in the first and second sub-groups; and
a display, which provides an indication of the occurrence values.
There is yet additionally provided, in accordance with a preferred embodiment of the present invention, a computer program product for extracting data from a corpus of information, including a plurality of information entries, each entry being assigned to one or more sub-groups according to a differentiating parameter of the entries, the program having computer-readable program instructions embodied therein, which instructions, when read by a computer, cause the computer to:
for a first one of the entries in a first one of the sub-groups, extract a characteristic of information contained therein;
for a second one of the entries in a second one of the sub-groups, extract the same characteristic of information;
automatically determine respective first and second occurrence values corresponding to the characteristic in the first and second sub-groups; and
provide an indication of the occurrence values.
The present invention will be more fully understood from the following detailed description of the preferred embodiments thereof, taken together with the drawings in which: