1. Field of the Invention
The present invention relates to a technology for extracting feature information from an electronic document.
2. Description of the Related Art
With large-capacity and cheap storage media and a rapid spread of an intranet and the Internet, it is possible to easily gather and store a large amount of electronic documents using a computer. Because a vast amount of information is available, when a user intends to acquire certain information from the electronic documents, an analysis tool that can output, according to the user's need, for example, relationship between character strings (hereinafter, “feature information”), such as keywords (words and compound words) and phrases that represent a feature of a document, and classification results based on frequency of occurrence of the feature information is indispensable.
However, a viewpoint in analyzing the information varies depending on a purpose, and the feature information also varies depending on the viewpoint. For example, when one tries to create a table shown in FIG. 9 by classifying or grouping a large amount of patent publications for a purpose of analyzing a recent technical trend in a field of hybrid electric vehicles, a keyword to be a reference for associating documents is different depending on the viewpoint. From a viewpoint of a subject matter of the invention (subject matter shown in FIG. 9), key words such as “CONTROLLING APPARATUS” and “DRIVING APPARATUS” may become the reference. From a viewpoint in which what kind of problem is to be solved by the invention (object shown in FIG. 9), key words such as “FUEL COST” and “FUEL CONSUMPTION” may become the reference.
In this regard, technologies to extract important character strings in the document and character strings to be a key in a specific viewpoint as the feature information have already been disclosed in, for example, Japanese Patent Application Laid-Open Publication No. H11-250097 and Japanese Patent Application Laid-Open Publication No. 2001-101199.
However, in the conventional technologies, if extraction rules are made stricter to improve an accuracy in extraction of the feature information, an extraction rate declines (i.e. information missed to be extracted increases), and if the extraction rules are relaxed to improve the extraction rate, the accuracy in the extraction declines (i.e. useless information increases).
For example, in a patent publication, if a part “ . . . ” in “RELATED TO . . . ” is extracted as the feature information from the viewpoint of the subject matter of the invention, and if a part “ . . . ” in “TO IMPROVE . . . ” is extracted as the feature information from the viewpoint of an object, from both the viewpoints, there is a chance in which a key word “ENGINE” is extracted as the feature information. If a part “ . . . ” in “TO IMPROVE . . . PROPERTY” instead of in “TO IMPROVE . . . ”, is extracted as the feature information so that “ENGINE” is not extracted from a target viewpoint, a word “EFFICIENCY” can not be picked up as the feature information from the target viewpoint from a phrase “TO IMPROVE EFFICIENCY”.
In the conventional technology, such a trade off has not been taken into consideration. Therefore, if an independency of each of the viewpoints is guaranteed (a plurality of viewpoints are not allowed to have the same feature information), the extraction rate is sacrificed.