1. Field of the Invention
The present invention relates to a key word deriving device, a key word deriving method, and a storage medium containing a key word deriving program. More particularly, it relates to a key word deriving device and a key word deriving method for deriving a key word from a large amount of data that characterizes the data by performing a statistical process on a partial region of the data, and a storage medium containing a program for deriving the key word.
2. Description of the Related Art
Japanese Unexamined Patent Publication No. HEI 08(1996)-202737 discloses a method for deriving a key word from a large amount of data by performing a statistical process on a partial region of the data. In this publication, a specification of a patent is used as an example, where the whole data are divided into individual paragraphs by referring to preliminarily prepared index words such as xe2x80x9cTitle of the Inventionxe2x80x9d or xe2x80x9cWhat is claimed isxe2x80x9d to determine the number of concurrences of each word with another word in the same sentence, the number of concurrences of each word with another word in the same paragraph, and the number of appearances of each word in the whole data, and then, after suitable coefficients are multiplied to these numbers, an arithmetic sum is calculated to determine an importance of each word to derive a key word.
In other words, the key word is not determined simply by using a frequency of appearances of each word, but words that appear concurrently with each other in the same sentence or in the same paragraph are regarded as having a greater importance (more relevance as a key word).
However, according to the key word deriving method disclosed in the above-mentioned Japanese Unexamined Patent Publication, the paragraphs are divided by using preliminarily prepared index words (such as xe2x80x9cTitle of the Inventionxe2x80x9d) on the basis of the special characteristics of the target data, so that the method of dividing the paragraphs is fixed. Also, the derived key word is a key word for the whole target data, so that the key word for each paragraph is not derived.
Therefore, there does not occur any great problem if the target data are such that each paragraph has its respective fixed meaning such as in a patent specification and the contents are complete in themselves in one document of the specification. However, the key word deriving method disclosed in this publication cannot be applied to a case in which the target data are, for example, a set of (electronic) mail sentences received/sent by an individual person or a set of news sentences in a day or in a month, i.e. when the target data are a set of data divisible by various parameters such as the sender/receiver or the time of occurrence (date and time), because it is difficult to grasp the contents of the whole target data.
The present invention provides a key word deriving device comprising: a document data acquiring section for acquiring document data each having a parameter previously added thereto; a document data dividing section for dividing the acquired document data for each type of the parameter by distinguishing the types of parameters of the document data; a document table registering section for assigning the type of the parameter to the divided document data as divided data and for registering, in a document table, words contained in the divided data and their statistical amounts; a word table registering section for calculating and registering, in a word table, the statistical amounts of the words in the divided data having the same type of the parameter added thereto by referring to the document table; an importance table registering section for calculating an importance of each word in accordance with a preliminarily prepared importance calculation formula by referring to the word table and for registering the importance of each word in an importance table; and a key word deriving section for deriving a word having a higher importance as a key word by referring to the importance table.
According to the present invention, various and numerous document data can be divided appropriately by using a parameter added to each document data, and an importance of each word is calculated from the words contained in the divided data and their statistical amounts, and a word having a high importance is derived as a keyword, thereby enabling derivation of the keywords which show more accurately the characteristics of each of the divided data in various and numerous document data.