The present invention relates to a document processor for displaying and printing multiple input document data in a predetermined format, a document processing method, and a computer-readable recording medium for recording a program to execute the method on a computer. Furthermore, this invention relates to a document classification device and a document classification method for classifying multiple input document data based on the contents thereof, and particularly for refining classification categories calculated during document classification, and to a computer-readable recording medium for recording a program to execute the method on a computer.
Various document classification devices and document retrieval devices have been developed in recent years. The proliferation of network technology, such as the Internet, has made it possible to access a huge amount of electronic documents, domestically and overseas, and there has been a proportionate rapid expansion in the amount of data which is stored electronically. Accordingly, there is an increasing need for intellectual operations such as classifying large collections of document data into meaningful categories.
The benefits of classifying large amounts of document data according to their meaning are as follows. Firstly, it makes it easier to retrieve data. Retrieval becomes relatively easy since vast groups of documents can be retrieved using category names as clues.
Secondly, entire groups of data can be grasped. That is, it is possible to grasp the contents (individual classifications) of an entire cluster of documents. However, when a large amount of document data is classified by an operator, although accurate classification can be achieved, classification requires enormous manpower and time. Consequently, in view of the huge amount of documents stored in recent years, devices for automatically classifying document data have been proposed.
As an example of a conventional device for automatically classifying documents, Japanese Patent Application Laid-open (JP-A) No. 7-36897 discloses a device which defines a document as a document vector characterized by a word, uses clustering to group these document vectors, and automatically classifies the documents based on the grouped document vectors.
Furthermore, in xe2x80x9cProjections for Efficient Document Clustering (Authors: Hinrich Schutze and Craing Silverstein, Academy: ACM, Title of Paper: Proceedings of SIGIR, pages: 78-81, Year of Publication: 1997)xe2x80x9d documents are classified in dormant meaning space. Other conceivable methods include using a probability theory approach, etc.
Furthermore, in recent years, the proliferation of the Internet and the like has made it possible to access large amounts of document clusters, and as a result, there is an increasing need to be able use these document clusters effectively, and in accordance with the intentions of a variety of users. To accomplish this, an intellectual operation is starting to be used in which a large amount of document clusters is classified into meaningful categories, and the structure of the document clusters is grasped. However, when this type of classification is performed manually, enormous manpower and time are required. Further, since only the classifier knows how to classify the document data, classification standard change when the person responsible for classification is replaced.
Consequently, there is a demand for a document classification device capable of automatically classifying groups of documents according to the same type of classification standards used by humans. For example, as disclosed in Japanese Patent Application Laid-open (JP-A) No. 7-114572, a document classification device capable of automatically extracting a word characteristic vector from a document, and classifying the document based on the characteristic vector, thereby making it possible to automatically classify the documents using meaningful differences.
However, since the conventional document classification device described above uses a method for statistically classifying documents arranged in multi-dimensional space essentially comprising words, the result of the classification is nothing more than the statistically determined behaviour of the words. Consequently, clusters (partial groups of individual classified documents) calculated after classification are sometimes incomprehensible to the operator (user).
A further problem is that the question of what kind of classification is appropriate depends on the characteristics of the document clusterings to be classified and the intentions of the user, making it difficult to define an appropriate classification. In particular, when grasping entire data groups as mentioned above, the type of classification required will differ depending on the widely varying intentions of the operators, and it will be difficult to obtain the result desired by the operator in a single classification.
Thus, the problem can be interpreted by saying that a document classification result includes a great amount of noise, only one part of which is of use to the operator.
Furthermore, the conventional technology does not consider the constitutional units of the document, and in a case where the structure of a document is partitioned by one or multiple period symbols, titles, and the like, multiple topics and meanings are contained in a single document. This results in problems that it is difficult for a user to understand the classification categories, the category may be limited to a specific topic or specific meaning, or the document may be classified under a category different to that intended by the user.
A context-dependent automatic classification device is disclosed in Japanese Patent Application Laid-open (JP-A) No. 6-176064, and aims to increase classification precision by automatically classifying documents in consideration of the conclusive data therein, but essentially does not solve the problems mentioned above.
Furthermore, conventional document processors, such as the document classification device and document retrieve device described above, merely classify or retrieve documents, and give no consideration to further analysis of information hidden in the document clusters. Consequently, they have a disadvantage that a separate analyzing device must be used to analyze information hidden in the document clusters.
Furthermore, the operator who wishes to analyze the information does not perform classification and retrieval as an end in itself, but simply as an intermediate Step during his analysis of the information. After classification and retrieval, in order to grasp the result more easily it is usually necessary to derive a meaningful result from the information analysis by repeating a variety of other processes, such as maximizing the practical usefulness of the information included in the original document, rearranging the result, carrying out totalization and statistical processing, and drawing up charts and graphs based on the results.
Furthermore, table-calculating software is sometimes needed when analyzing information about numerical data. However, table-calculating software was originally developed to handle numerical data, and is not sufficiently effective for analyzing textual data, particularly when the analysis concerns the meaning of documents.
This invention has been achieved in order to solve the problems of the conventional examples described above. It is a first object of the present invention to provide a document processor, a document processing method, and a computer-readable recording medium storing programs for executing the method on a computer, for carrying out analysis concerning the meaning of documents, not simply by outputting the results of fixed functions such as classification and retrieval, but by supporting a complete range of information analysis.
To solve the problems of the conventional example described above, it is a second object of the present invention to provide a document classification device and a document classification method capable of momentarily determining what type of contents are contained in a given document cluster, and a computer-readable recording medium for storing programs for executing the method on a computer.
Furthermore, to solve the problems of the conventional example described above, it is a third object of the present invention to provide a document classification device and a document classification method wherein, when one document contains multiple topics and meanings, these can be classified into categories according to specific topics and meanings, so that the classifications do not differ from categories desired by a user, thereby enabling the user to easily comprehend the classification categories, and a computer-readable recording medium for storing programs for executing the method on a computer.
In order to solve the problems mentioned above, the document processor according to one aspect of the present invention for displaying and printing in a predetermined format multiple input document data, comprises a document memory unit for storing input document data; a selection unit for selecting all or part of document data stored in the documents memory unit; a characteristics extraction unit for extracting data relating to characteristics of letter rows from all or part of the document data selected by the selection unit; a work processing unit for work-processing all or part of the document data based on the data relating to characteristics of letter rows extracted by the characteristics extraction unit; and an output unit for outputting all or part of the document data work-processed by the work processing unit.
According to the above aspect of this invention, when analyzing documents according to their meanings, rather than merely outputting the result of the analysis, the entire information analysis operation can be supported.
Further, the output unit of the document processor comprises an item value set unit for setting a plurality of item values based on the contents of all or part of the document data work-processed by the work-processing unit; and a totalization unit for totalizing all or part of the document data for each item value set by the item value set unit. Furthermore, the output unit outputs all or part of the document data in the format of a table having an item value as at least one axis.
Hence the result of the work-processing can easily be expressed in a cross table, and the contents of the information can easily be grasped. Therefore, when analyzing documents according to their meanings, rather than merely outputting the result of the analysis, the entire information analysis operation can be supported.
Further, the output unit outputs all or part of the document data work-processed by the work processing unit together with all or part of the document data in its state prior to work-processing by the work processing unit.
Hence data to be work-processed and other data can be displayed simultaneously and identified, whereby the range of the work-processing to be carried out can be accurately and easily determined. Therefore, when analyzing documents according to their meanings, rather than merely outputting the result of the analysis, the entire information analysis operation can be supported.
Further, the document memory unit also stores all or part of the document data work-processed by the work processing unit.
Since other data can be handled simultaneously, when thereafter analyzing documents according to their meanings, rather than merely outputting the result of the analysis, the entire information analysis operation can be supported.
Further, the selection unit further selects all or part of the document data output by the output unit.
Since all or part of the document data output by the output unit can be selected for analysis, a wide variety of information can be analyzed with high precision. Therefore, when analyzing documents according to their meanings, rather than merely outputting the result of the analysis, the entire information analysis operation can be supported.
Further, the document memory unit further stores data relating to contents of the work processing.
Hence not only can loss of data relating to the contents of work-processing can be prevented and the data managed easily, but also the relationship between settings used in the work-processing and the processed result can be determined. Therefore, when analyzing documents according to their meanings, rather than merely outputting the result of the analysis, the entire information analysis operation can be supported.
A document classification device for classifying documents based on contents thereof according to another aspect of the present invention comprises an input unit for inputting document data; a language analyzer unit for analyzing document data input by the input unit and obtaining language analysis information; a vector creation unit for document characteristic vectors for the document data based on the language analysis information obtained by the language analyzer unit; a classification unit for classifying documents based on the degree of similarity between document characteristic vectors created by the vector creation unit, and creating clusters of documents; a cluster characteristics calculation unit for calculating cluster characteristics, which are characteristics of clusters of documents created by the classification unit; and a classification category memory unit for storing cluster characteristics, calculated by the cluster characteristics calculation unit, as constituent elements of classification categories.
According to the above aspect of this invention, it is possible to obtain clusters, and to structure and categorize the clusters based on their contents using their degree of similarity to the cluster center, and the like.
A document classification device for classifying documents based on contents thereof according to still another aspect of the present invention comprises an input unit for inputting document data; a language analyzer unit for analyzing document data input by the input unit and obtaining language analysis information; a vector creation unit for creating document characteristic vectors for the document data based on the language analysis information obtained by the language analyzer unit; a classification unit for classifying documents based on the degree of similarity between document characteristic vectors created by the vector creation unit, and creating clusters of documents; a cluster characteristics calculation unit for calculating cluster characteristics, which are characteristics of clusters of documents created by the classification unit; a display unit for displaying the cluster characteristics calculated by the cluster characteristics calculation unit; a cluster selection specification unit for selecting predetermined clusters from cluster of documents created by the classification unit; and a classification category memory unit for storing cluster characteristics, calculated by the cluster characteristics calculation unit, as constituent elements of classification categories.
According to the above aspect of this invention, only selected clusters are used, making it possible to structure and categorize to clusters in a manner closer to that desired by the operator.
Further, the arrangement of the present invention described above further comprises a document characteristic vector memory unit for storing document characteristic vectors created by vector creation unit; and a vector correction unit for correcting document characteristic vectors stored in the document characteristic vector memory unit, so that document characteristic vectors of documents belonging to clusters selected by the cluster selection unit are deleted. Furthermore, the classification unit classifies documents based on the document characteristic vectors corrected by the vector correction unit.
Hence the effects of clusters which are already known can be eliminated, and new clusters can be created.
Further, the document classification device of the present invention further comprises a document characteristic vector memory unit for storing document characteristic vectors created by vector creation unit; and a document expression space correction unit for correcting document expression space when determining the degree of similarity between document characteristic vectors stored in the document characteristic vectors memory unit, based on a characteristics amount calculated from clusters selected by the cluster selection unit. Furthermore, the classification unit classifies documents based on the degree of similarity between document characteristic vectors created by the vector creation unit, using the document expression space corrected by the document expression space correction unit.
Hence, cluster characteristics selected by the operator in the previous classification can be eliminated from the next classification, enabling new clusters to be created.
Further, the document classification device according to the present invention further comprises a document characteristic vector memory unit for storing document characteristic vectors created by vector creation unit; and a document expression space correction unit for correcting document expression space when determining the degree of similarity between document characteristic vectors stored in the document characteristic vectors memory unit, based on a characteristics amount calculated from clusters selected by the cluster selection unit. Furthermore, the classification unit classifies documents based on the degree of similarity between document characteristic vectors created by the vector creation unit, using the document expression space corrected by the document expression space correction unit.
Hence influences of the known cluster can be eliminated and cluster characteristics selected by the operator in the previous classification can be eliminated from the next classification, enabling new clusters to be created.
Further, the document classification device of the present invention further comprises a selection information appending unit for appending selection information showing the fact of selection when all or part of the documents belonging to a cluster of documents created by the classification unit have been selected. Furthermore, the display unit displays the cluster characteristics, and also displays the selection information appended by the selection information appending unit.
Hence it is possible to improve the ability to identity documents used on multiple occasions, and the ability to identify documents which have not been selected at all.
Further, the classification category memory unit stores cluster characteristics and/or information created by an operator, in addition to all or part of the documents belonging to a cluster of documents selected by the selection specification unit, as constituent elements of classification categories.
Hence the contents of clusters can be easily recognized, and in addition, the operator can easily create his own classification categories, thereby improving the usefulness of the classification categories.
A document classification device for classifying document clusters in accordance with contents thereof according to still another aspect of the present invention comprises a document input unit for inputting document data groups; a document dividing unit for dividing document data into one or multiple divided document data based on a predetermined reference; a document-divided document map creation unit for creating a map showing the correspondence between the document data and the divided document data; a divided document classification unit for classifying the divided document data; a divided document classification result creation unit for creating divided document classification result information based on a classification result of the divided document classification unit; and a document classification result creation unit for creating classification result information of the above document data using the document-divided document map and the divided document classification result information.
According to the above aspect of this invention, when one document contains multiple topics and meanings, these can be classified into categories according to specific topics and meanings, so that the classifications do not differ from categories desired by a user, thereby enabling the user to easily comprehend the classification categories. Furthermore, since the positions of the divided documents in documents prior to division (documents belonging to the clusters) is displayed, the user is able to efficiently read the parts of the document clusters he or she wishes to read.
Further, the document classification device further comprises a document save unit for saving the document data; a divided document save unit for saving the divided document data; and a document-divided document map save unit for saving a document-divided document map created by the document-divided document map creation unit.
Hence for a single document data, it is possible to efficiently determine classification results having different parameters such as the number of classifications, the classification method, and the settings used in the classifications, without recreating the divided document data and the document-divided document map. Furthermore, by classifying the document data and saving the data needed to create the classification result, the user is free to take more time over the classification, and to re-analyze previously classified documents within a given period of time.
Further, the document classification device in the specific arrangement described above further comprises a divided document classification result save unit for saving divided document classification result information created by the divided document classification result creation unit.
Hence, an additional effect, such that after one classification has been carried out, the result of that classification can be expressed in a variety of formats such as text, charts, graphs, and the like can be achieved. Furthermore, by saving the divided document classification result information, the user is free to take more time over classifications and analysis of classification results, and to re-analyze previously classified documents in a variety of formats within a given period of time.
Further, the multiple divided document data created by the document dividing unit contains the document data in its state prior to being divided.
Hence in addition to a classification structure of detailed document data, obtained by classifying the divided document data, the user can obtain a classification structure fusing schematic macro classifications as a result of classifying the document data itself prior to division.
Further, the document dividing unit divides document data based on information relating to the structure of the document data.
Hence division and the like of different topics can be carried out, whereby documents can be classified in such a manner that the detailed classification structures of their document data can be known.
Further, the document classification device further comprises a document element extraction unit for extracting elements in the document data; an element-accompanying information extraction unit for extracting element-accompanying information accompanying the elements extracted by the document element extraction unit. Furthermore, the document dividing unit divides the document data using elements extracted by the document element extraction unit, or the elements and element-accompanying information extracted by the element-accompanying information extraction unit.
Hence documents can be classified so that the detailed classification structure of the document data can be known.
Further, the document dividing unit divides document data in compliance with a specified specification range.
Hence documents can be classified in accordance with the wishes of the user, and so that the detailed classification structure of the document data can be known.
Further, the document dividing unit divides document data based on the number of letters, the number of sentences, or both the number of letters and the number of sentences.
Hence there is an increased capability to classify different documents having contents of different topics and the like. Therefore, as above, documents can be classified so that the detailed classification structure of the document data can be known.
Further, the document classification result creation unit extracts and presents information showing document data, and representative information accompanying the document data, as classification result information.
Hence the user is able to determine a detailed schematic structure or overall structure of the document data.
Further, the document classification result creation unit extracts and presents information showing divided-document data, and representative information accompanying the divided document data, as classification result information.
Hence the user is able to determine a detailed schematic structure or overall structure of the document data. In addition, the user can easily determine which divided document has been classified in a given category.
A document processing method according to still another aspect of the present invention outputs multiple input document data in order to display or print the document data in a predetermined format, and comprises the steps of storing input document data; selecting all or part of the document data stored in the documents memory unit; extracting data relating to characteristics of letter rows from all or part of the document data selected by the selection unit; work-processing all or part of the document data based on the data relating to characteristics of letter rows extracted by the characteristics extraction unit; and outputting all or part of the document data work-processed by the work processing unit.
According to the above aspect of this invention, when analyzing documents according to their meanings, rather than merely outputting the result of the analysis, the entire information analysis operation can be supported.
Further, the step of outputting comprises the steps of setting a plurality of item values based on the contents of all or part of the document data work-processed by the work-processing unit; and totalizing all or part of the document data for each item value set by the item value set unit; and outputs all or part of the document data in the format of a table having an item value as at least one axis.
Hence the result of the work-processing can easily be expressed in a cross table, and the contents of the information can easily be grasped. Therefore, when analyzing documents according to their meanings, rather than merely outputting the result of the analysis, the entire information analysis operation can be supported.
Further, the step of outputting further comprises outputting all or part of the document data work-processed by the work processing unit together with all or part of the document data in its state prior to work-processing by the work processing unit.
Hence the data to be work-processed and other data can be displayed simultaneously and identified, whereby the range of the work-processing to be carried out can be accurately and easily determined. Therefore, when analyzing documents according to their meanings, rather than merely outputting the result of the analysis, the entire information analysis operation can be supported.
Further, the step of storing further comprises storing all or part of the document data work-processed by the work processing unit.
Since other data can be handled simultaneously, when thereafter analyzing documents according to their meanings, rather than merely outputting the result of the analysis, the entire information analysis operation can be supported.
Further, the step of selecting further comprises selecting all or part of the document data output by the output unit.
Since all or part of the document data output by the output unit can be selected for analysis, a wide variety of information can be analyzed with high precision. Therefore, when analyzing documents according to their meanings, rather than merely outputting the result of the analysis, the entire information analysis operation can be supported.
Further, the step of storing a document further comprises storing data relating to contents of the work processing.
Hence not only can loss of data relating to the contents of work-processing can be prevented and the data managed easily, but also the relationship between settings used in the work-processing and the processed result can be determined. Therefore, when analyzing documents according to their meanings, rather than merely outputting the result of the analysis, the entire information analysis operation can be supported.
A document classification method for classifying documents based on contents thereof according to still another aspect of the present invention comprises the steps of inputting document data; language-analyzing document data input in the step of inputting and obtaining language analysis information; creating document characteristic vectors for the document data based on the language analysis information obtained in the step of language-analyzing; classifying documents based on the degree of similarity between document characteristic vectors created in the step of creating vectors, and creating clusters of documents; calculating cluster characteristics, being characteristics of clusters of documents created in the step of classifying; and storing cluster characteristics, calculated in the step of calculating cluster characteristics, as constituent elements of classification categories.
According to the above aspect of this invention, it is possible to obtain clusters, and to structure and categorize the clusters based on their contents using their degree of similarity to the cluster center, and the like.
A document classification method for classifying documents based on contents thereof according to still another aspect of the present invention comprises the steps of inputting document data; language-analyzing document data input in the step of inputting and obtaining language analysis information; creating document characteristic vectors for the document data based on the language analysis information obtained in the step of language-analyzing; classifying documents based on the degree of similarity between document characteristic vectors created in the step of creating vectors, and creating clusters of documents; calculating cluster characteristics, which are characteristics of clusters of documents created in the step of classifying; displaying the cluster characteristics calculated in the step of calculating cluster characteristics; selecting predetermined clusters from cluster of documents created in the step of classifying; and storing cluster characteristics, calculated in the step of calculating cluster characteristics, as constituent elements of classification categories.
According to the above aspect of this invention, only selected clusters are used, making it possible to structure and categorize to clusters in a manner closer to that desired by the operator.
Further, the document classification method further comprises a step of correcting document characteristic vectors stored in the step of storing document characteristic vectors, so that document characteristic vectors of documents belonging to clusters selected by the step of selecting clusters are deleted. Furthermore, the step of classifying comprises classifying documents based on the document characteristic vectors corrected by the step of correcting vectors.
Hence the effects of clusters which are already known can be eliminated, and new clusters can be created.
Further, the document classification method further comprises a step of correcting document expression space when determining the degree of similarity between document characteristic vectors stored in the step of storing document characteristic vectors, based on a characteristics amount calculated from clusters selected in the step of selecting clusters, and the step of classifying comprises classifying documents based on the degree of similarity between document characteristic vectors created in the step of creating vectors, using the document expression space corrected in the step of correcting the document expression space.
Hence cluster characteristics selected by the operator in the previous classification can be eliminated from the next classification, enabling new clusters to be created.
Further, the document classification method further comprises the steps of correcting document expression space when determining the degree of similarity between document characteristic vectors stored in the step of storing document characteristic vectors, based on a characteristics amount calculated from clusters selected in the step of selecting clusters. Furthermore, the step of classifying comprises classifying documents based on the degree of similarity between document characteristic vectors created in the step of creating vectors, using the document expression space corrected in the step of correcting the document expression space.
Hence influences of the known cluster can be eliminated and cluster characteristics selected by the operator in the previous classification can be eliminated from the next classification, enabling new clusters to be created.
Further, the document classification method further comprises the steps of appending selection information showing the fact of selection when all or part of the documents belonging to a cluster of documents created in the step of classifying have been selected. Furthermore, the step of displaying comprises displaying the cluster characteristics, and displaying the selection information appended in the step of appending selection information.
Hence it is possible to improve the ability to identify documents used on multiple occasions, and the ability to identify documents which have not been selected at all.
Further, the step of creating classification categories comprises creating cluster characteristics and/or information created by an operator, in addition to all or part of the documents belonging to a cluster of documents selected in the step of specifying selection, as constituent elements of classification categories.
Hence the contents of clusters can be easily recognized, and in addition, the operator can easily create his own classification categories, thereby improving the usefulness of the classification categories.
A document classification method for classifying document clusters in accordance with contents thereof according to still another aspect of the present invention comprises the steps of inputting document data groups; dividing document data into one or multiple divided document data based on a predetermined reference; creating a map showing the correspondence between the document data and the divided document data; classifying the divided document data; creating divided document classification result information based on the classification result of classifying the divided documents; and creating classification result information of the document data using the document-divided document map and the divided document classification result information.
According to the above aspect of this invention, when one document contains multiple topics and meanings, these can be classified into categories according to specific topics and meanings, so that the classifications do not differ from categories desired by a user, thereby enabling the user to easily comprehend the classification categories. Furthermore, since the positions of the divided documents in documents prior to division (documents belonging to the clusters) is displayed, the user is able to efficiently read the parts of the document clusters he or she wishes to read.
A computer-readable recording medium of still another aspect of the present invention stores programs for executing the above-described document classification method on a computer, thereby making the program readable mechanically, and enabling the operation of the document classification method to be executed by a computer.
Other objects and features of this invention will become understood from the following description with reference to the accompanying drawings.