1. Field of the Invention
The present invention relates to data analysis utilizing an electronic computer and to display of a result of analysis. In particular, the invention can be applied to display of results of document research while the results are classified through use of keywords and analysis and to display of a relationship between clients and commodity products in connection with analysis of a market.
2. Description of the Related Art
A relational matrix defined by words and documents is often employed in classification and analysis of documents. This corresponds to a matrix which is defined by assigning words to rows and documents to columns and recording the number of times words appear in corresponding documents (see FIG. 3). A vector expression of a word can be extracted by picking up the rows of the matrix one by one, or a vector expression of a document can be extracted by picking up the columns one by one. Hence, a distance between two words A and B can be defined by the distance between vectors, a cosine of vectors, or an inner product of vectors. Similarly, a distance between documents can also be defined by a distance between vectors, a cosine of vectors, or an inner product of vectors. Specifically, a distance between words can be expressed by comparison between vectors which employ documents as components, and a distance between documents can be expressed by means of comparison between vectors which take words as components. Further, a distance between a document and a word can be defined by reference to the number of times the word appears in the document.
A relationship between clients and purchased commodity products in relation to marketing is taken as example data which can be expressed in the form of a matrix. In a matrix in which commodity products are assigned to rows, and clients are assigned to columns, if data pertaining to the specific commodity products and quantities purchased by a certain client are recorded, thereby enabling recording of a relationship between clients and commodity products (see FIG. 9). Even in such a case, vector expressions of respective clients or those of respective commodity products can be extracted. A vector of a certain client shows the client's preference for a commodity product. Clients having vectors of the same lengths can be said to have the same preferences. Even in this case, a distance between clients can be expressed by vectors which take commodity products as components. A distance between commodity products can be expressed by vectors which take clients as components.
In this example, documents and words are related to each other in the form of rows and columns of a matrix. Clients and commodity products are also related to each other in the same fashion. A large number of combinations of data are defined in the form of such a relationship. In subsequent descriptions, a matrix is described by taking a relationship between words and documents as an example.
As a result of the proliferation of IT technology and the Internet, the number of documents produced in electronic form is increasing explosively. For instance, electronic versions of existing newspaper articles and existing patent publications, which have already been issued, have reached an enormous volume, and their volume is certain to increase continuously in the future. Effective utilization of such documents inevitably requires search, classification, and analysis means which enable on-target selection of a target document.
The following methods are broadly grouped and available as means for classifying results of search of a document.
(1) A first method is to establish classification criteria beforehand and classify documents according to the criteria. FIG. 17 is a flowchart showing the outline of operation and processing pertaining to the method. At the outset, criteria are manually prepared as a preparatory stage (1701). It is a common practice that, once the criteria have been established, they can be used for general purposes over several occasions. Next, a document is searched (1702), and a cluster of search results is automatically classified in accordance with the criteria (1703). The results are displayed on a per-category basis (1704). This method is suitable for use with newspaper articles for which categories can have been prepared beforehand.
(2) A second method is to locate an aggregation of documents in a space through use of only distances among the documents. Computation is performed repeatedly until location of the aggregation is completed, whereby self-organizing classification becomes feasible. Famous means for realizing the second method include an SOM (self-organizing map) [a reference document: T. Kohonen “Self-organizing Map” Springer-Verlag Tokyo, ISBN 4-431-70700-X(1996)] and a layout based on a spring model [a reference document: Peter Eades: “A Heuristic for Graph Drawing,” Congressus Numerantium, Vol. 42 (1984)], [a reference document pertaining to an example applied to analysis of documents: Isamu WATANABE “Visual Text Mining,” Vol. 16, No. 2 (2001), Journal of JSAI (Japanese Society for Artificial Intelligence)].
The spring model is a layout method specific to an undirected graph (a graph involving no directions) and can be applied to classification and arrangement of documents and words. For instance, when documents are arranged, documents are deemed as nodes of a graph. The nodes are deemed to be connected together by springs in accordance with a distance between the documents (or the degree of similarity). FIG. 28 shows an example of an initial state.
As shown in FIG. 28, nodes schematically represent documents, and serrated lines schematically represent springs. A system formed from the nodes and the springs is brought to a stable state; that is, a state in which the respective springs are settled with lengths close to their original lengths or without involvement of expansion or contraction. Consequently, similar documents are located adjacent to each other, and non-similar documents are located so as to become distant from each other. FIG. 29 shows such an example.
From the example shown in FIG. 29, it can be visually ascertained that documents A, B, and C are analogous to each other but a document D is not analogous to any of the documents A, B, and C.
The methods, such as the SOM and the spring model, enable realization of an arrangement suitable for an aggregation of documents obtained as a result of search and hence enable flexible classification of documents on a per-search basis. Under these methods, self-organizing classification is performed. Therefore, a result of classification does not necessarily comply with a guideline which is visibly understandable for persons. Hence, a cluster of results is subjected to labeling.
A flowchart shown in FIG. 18 shows the labeling operation. Specifically, documents are first searched (1801), and self-organization and arrangement of the thus-searched documents are performed (1802). On the basis of a result of arrangement, the documents are divided into clusters (1803). The respective clusters are labeled (1804). Finally, the result of arrangement and the labels are displayed (1805). JP-A-8-263514 describes an example to which an SOM is applied as the previously-described self-organizing method. In many occasions, a result of adoption of the SOM is displayed in the form of a cluster of cells, such as that shown in FIG. 22. A result of use of the spring model is often displayed as an arrangement of data in a space, such as that shown in FIG. 20.
(3) A third method is to classify documents in accordance with the degree of proximity to a keyword. FIG. 19 is a flowchart showing the outline of operation and processing of this method. First, documents are searched (1901). The documents obtained as a result of search are afforded keywords by a person, or keywords are automatically extracted for the documents (1902). The keywords are arranged at fixed points in a space (1903). The individual documents obtained as a result of search are arranged in the same space in accordance with the degree of proximity to the keywords (1904). Finally, a result of arrangement is displayed (1905). JP-A-2000-76279 describes an example of this method.
(4) JP-A-10-171823 describes a technique for arranging documents in a given dimensional space in accordance with proximity and non-proximity in terms of semantic contents, by means of clustering documents represented in the form of vectors into an appropriate number of groups and applying mapping means only to typical centers of the clusters. According to this technique, documents to be analyzed are first transformed into vectors by vector transform means 3503. The documents, which have been transformed into vectors, are classified into clusters by clustering means 3504. Then, typical vectors of the respective clusters are extracted by cluster center extraction means 3505. The cluster centers are arranged in a low-dimensional space while distances between the cluster centers are kept as intact as possible. The documents included in the respective clusters are arranged on the basis of the thus-determined arrangement and positions and the result of classification of the vectors determined by the clustering means 3504. At the time of arrangement of documents, the documents are compared with the center of the cluster located adjacent to the cluster to which the vectors of the documents belong.
However, the first classification means enables classification of documents in accordance with only predetermined criteria. This method may be suitable for classifying newspaper articles into categories, such as economics and sports. However, a situation in which search results must be classified in accordance with new criteria is encountered in search of documents at all times. Even when sports are classified into professional sports and amateur sports, the Olympic Games, which have been changed so as to permit participation of professional athletes, may require another criterion. Classification changes according to the circumstances. Hence, a limit is imposed on the method for establishing criteria in advance.
According to the second classification method, computation of distances among all the documents must be performed repeatedly until the documents are settled at appropriate positions (or location of the documents is completed) in order to effect self-organizing classification. When the number of documents to be classified has become enormous, continuation of computation until location of the documents is completed incurs very high expenses. Therefore, the method is to be said to be less practical.
In relation to the spring model, a model pertaining to four nodes is shown in FIG. 28. FIG. 30 is a schematic representation of a model pertaining to eight nodes, in which springs are depicted as lines. As can be seen from this drawing, when the number of nodes is doubled; that is, when four nodes are doubled to eight nodes, the number of springs is quadrupled. When N documents are interconnected with springs, the number of springs is determined as {N×(N−1)}/2. Consequently, the number of springs is on the order of the square of N.
Provided that documents can have been arranged in a space through use of the spring model, as shown in FIG. 20, determination of the nature of clusters is a delicate problem. Even when the documents have been clustered in such a manner as shown in FIG. 21, the clusters are not always appropriately labeled with labels (character strings) signifying clusters. Since the clusters are determined through computation of multi-dimensional vectors, there are no guarantees that classification is easily understandable for a person. Even if an attempt is made to extract classified labels from titles of documents and display the thus-extracted labels in a manner as described in JP-A-2000-82068, appropriate labels will not always be extracted when labels of the clustered documents differ from each other or when a large number of documents of the same title are present in another cluster. Hence, the labeling problem cannot be solved unless expensive computation is performed after the documents have been classified and arranged. The same also applies to the SOM.
The third classification method is based on the premise that keywords are fixedly displayed and spaced uniformly from each other. When a person designates keywords, the person is not allowed to set a desired number of desired classification words. Specifically, when, for example, six keywords have been selected, the six keywords will not always be words which are optimal for classification of an aggregation of documents and which represent opposites. For example, when an attempt is made to classify newspaper articles pertaining to sports, words which are not uniform in terms of conceptual or abstract level, such as “Baseball,” “Ball Game,” “High-school Baseball Games,” or “J-League,” or which are not suitable for classifying an aggregation of documents may be designated. In a case where keywords are extracted by a computer, even if appropriate keywords are extracted, the keywords are uniformly spaced apart from each other with regard to an aggregation of given documents, and hence there may arise a chance of the aggregation of documents being classified into clusters different from the original characteristics of the documents. More specifically, on the premise that six keywords are arranged in a hexagonal pattern in a manner as described in JP-A-2000-76279, when only one of the six keywords has a meaning unique to the aggregation of documents, appropriate classification and arrangement of the documents should fail to be achieved.
Under the fourth technique, at the time of arrangement of each document, the document is compared with a center of a cluster located in proximity to a cluster to which the document belongs. However, the document is not compared with centers of all clusters. Therefore, even when a document vector classified into a certain cluster actually has a characteristic similar to a center of a cluster located outside the neighborhood of the cluster to which the document belongs, the influence of the center of the cluster located outside the neighborhood is disregarded. Hence, a mapping result accurately reflecting the characteristic of a document can hardly be attained. Moreover, when documents are arranged, the documents are not labeled. Therefore, a displayed result of arrangement may be visually less discernible for a user. In order to realize display of labeled documents, expensive computation, such as computing operation for determining labels on the basis of a result of arrangement of data or a correspondence between labels and data, is required.