1. Field of the Invention
This invention relates to a structured document classification device, a structured document search system, and a computer-readable memory causing a computer to function as a structured document classification device and a structured document search system.
This application is based on Japanese Patent Application No. Hei 10200171, the contents of which are incorporated herein by reference.
2. Description of Related Art including information disclosed under 37 CFR 1.97 and 1.98.
One of the conventional processes for searching for a desired document in a database of structured documents, which is principally the WWW (World Wide Web), includes collecting published documents on the WWW by a robot, which is called xe2x80x9ccrawlerxe2x80x9d, and converting the documents into the database to allow full-text retrieval. xe2x80x9cGooxe2x80x9d (http://www.goo.ne.jp) is an example of the service providing this retrieval.
The database includes more than one million documents, and as the WWW further expands, the number of the documents will increase. Therefore, in response to a small number of keywords input by a user initiating a retrieval for a document, a great number of results are returned. The user must find a target document among a great number of results, which takes much time and labor. Therefore, the conventional retrieval processes are of little practical use.
A conventional process for performing retrieval using the structural features in structured documents, for example, of SGML (Standard Generalized Markup Language) is disclosed in Japanese Unexamined Patent Application, First Publication No. Hei 7-225771. This system prepares a retrieval expression which includes the structural features of the structured documents, and enables a precise retrieval when the type of a retrieval target document (for example, a patent document, a study, or a specification) is clear.
The conventional system of Japanese Unexamined Patent Application, First Publication No. Hei 7-225771, can perform an accurate retrieval by specifying a target document in the SGML document database by keyword and type of the target document, but is not applicable to the structured documents (HTML: Hypertext Markup Language) on the WWW whose structure is not clearer than that of the SGML.
Further, because the process of Japanese Unexamined Patent Application, First Publication No. Hei 7-225771 requires examples of the structured documents, the conventional process is not applicable to the WWW.
Further, Japanese Unexamined Patent Application, First Publication No. Hei 9-311869 discloses a search server which, in response to an input of search parameters, searches for a target information from a number of URLs. Japanese Unexamined Patent Application, First Publication No. Hei 10-124519 discloses an information display device which automatically arrange keywords in a hierarchical structure.
It is therefore an object of the present invention to provide a structured document classification device which enhances the accuracy of a search and reduces the labor of a searcher searching for a target document by classifying the target HTML documents according to types beforehand.
In one aspect of the present invention, the structured document classification device for classifying structured documents by types, comprises: a structural feature extracting section for extracting a structural feature or an incidental feature from each structured document; a structural feature rule base for storing a rule dedicated to the extracted structural feature or incidental feature; and a verifier for verifying each feature, which is extracted by the structural feature extracting section, according to the rule stored in the structured rule base, calculating relevance to each type.
The structural feature extracting section includes a keyword feature extractor for extracting a tag and keyword pair from each structured document. The structural feature extracting section may include a image file feature extractor for extracting a feature of an image file from each structured document. The structural feature extracting section may include a link feature extractor for extracting a feature of a link from each structured document. The structural feature extracting section may include a tag structural feature extractor for extracting a feature of a tag structure from each structured document. The structural feature extracting section may include a URL feature extractor for extracting a feature of URL information from each structured document. The structural feature extracting section may include a plugin feature extractor for extracting a feature of a plugin from each structured document. The structural feature extracting section may include an upper-lower level feature extractor for extracting structural features of an upper level document and of a lower level document from each structured document. Further, the structural feature extracting section may extract any combination of features of a tag and keyword pair, an image file, a link, a tag structure, URL information, and a plugin.
The structured document classification device of the present invention further comprises: a score controller for controlling the relevance of each structured document according to a control rule which finely controls the relevance in consideration of relationships between the types and of the context as a whole.
In another aspect of the present invention, the structured document search system using the structured document classification device, comprises: a input/output device for inputting a search parameter and a type of a target structured document and for outputting search results; a search engine for performing a search in a database storing structured documents by the input search parameter; a type searcher for searching for the relevance to the input type found by the search engine, the relevance being calculated by the structured document classification device; and a restrictor for receiving the search results from the search engine, receiving the relevance of the structured document found by the type searcher, restricting the search results by consulting the relevance to the input type, and outputting the narrowed search results to the input/output device.
Instead of the restrictor, the system may have a separator for receiving the search results from the search engine, receiving the relevance of the structured document found by the type searcher, grouping the found documents according to their relevance to the input type, and outputting the search results to the input/output device.
The first advantage of the present invention is that the classification of the structured documents, for example, HTML documents, is made accurate because this invention extracts the features of a tag and keyword pair, an image file, a link information, a tag structure, URL information, a plugin information, any combination of these, or upper and lower level documents.
The second advantage of the present invention is that the classification is made consistent because this invention finely controls the search results in consideration of the relationships between the groups in the classification and the context as a whole.
The third advantage of the present invention is that a target HTML document can be found efficiently, because this invention calculates the relevance to the types accurately beforehand and narrows the search results based on the relevance, or because this invention calculates the relevance to the types accurately beforehand and displays the search results by groups.