1. Field of the Invention
The present invention relates to an apparatus for searching a document structure and document contents at high speed from a large number of structured documents, such as SGML documents, registered in a data base. More particularly, the present invention relates to a structured-document search apparatus which has means to convert a query of structure and contents to a Boolean expression which has been used in a conventional full-text search engine, to thereby enable utilization of the high-speed search performance of the full-text search engine.
2. Description of the Related Art
As a result of recent popularization of word processors and development of OCRs (Optical Character Readers), a huge volume of electronic documents have been created and accumulated. As the amount of accumulated documents becomes increasingly huge, demand for searching a necessary document at high speed becomes stronger and stronger.
In order to satisfy such demand, there have been developed a full-text search engine as described in, for example, Japanese Patent Application Laid-Open No. 10-27183 (Data Registration Method and Apparatus) and Japanese Patent Application Laid-Open No. 8-249354 (Word Index, Word-index Creation Apparatus, and Document Search Apparatus). Such a full-text search engine is designed to search the entirety of each document, and has indexes for referring at high speed to documents which include a designated search key. Each of the full-text search engines described in these publications eliminates the necessity of adding to each document keywords for searching and does not cause oversight during searching. However, since the entirety of each document is searched in a uniform or fixed manner, a search request with designation of a document structure cannot be processed.
Meanwhile, with explosive popularization of the Internet, there have been created a large volume of documents each having a structure (hereinafter referred to as “structured documents”), such as HTML (Hypertext Markup Language) documents and XML (Extensible Markup Language) documents. Further, in enterprises, SGML (Standard Generalized Markup Language) documents have been created and accumulated for the document management and re-use of documents. In relation to search of such structured documents, there has been increasing demand for a technique which does not only search the entirety of each structured document uniformly but also enables a user to designate search conditions for each part of each document. In order to satisfy such demand, there have been developed various techniques; e.g., techniques disclosed in Japanese Patent Application Laid-Open No. 11-15843 (SGML Document Search Apparatus and SGML Document Search Method), Japanese Patent Application Laid-Open No. 11-53400 (Structured-Document Search apparatus and Machine-Readable Recording Medium Storing Program), and Japanese Patent Application Laid-Open No. 11-242676 (Structured-Document Registration Method, Search method, and Transportable Medium used Therefor).
Japanese Patent Application Laid-Open No. 11-15843 discloses a technique such that structured documents are registered into a relational data base; and a user is allowed to input a search request by use of SQL, which is a conventional query language for data base query. When such a technique is used, a schema must be defined in advance, and document parts which do not conform to the schema cannot be registered. Further, when a large volume of documents is registered in the data base, the search speed decreases. Therefore, in order to search the contents of documents at high speed, a full-text search engine must be provided separately from the data base.
Japanese Patent Application Laid-Open No. 11-53400 discloses a technique such that a certain region of each document is divided into a plurality of zones; and searching is performed by use of a Boolean expression on the basis of combination of a zone and a keyword. Although this technique can search at high speed the contents of text data included in a certain document part, it does not allow a user to include in search conditions a hierarchical relationship between document parts.
Japanese Patent Application Laid-Open No. 11-242676 discloses a technique which utilizes a structure index obtained through superposition of document parts of the document registered in a data base and a character index in relation to contents of each document. Although this technique requires an index for holding the structures of documents in addition to an index of an ordinary full-text search engine, it can perform searching at high speed under search conditions which include a hierarchical relationship between document parts.
Japanese Patent Application Laid-Open No. 7-56908 (Document Processing Apparatus) and Japanese Patent Application Laid-Open No. 7-319918 (Apparatus for Designating Object to be Subjected to Document Searching) disclose techniques for searching structured documents. Although these publications disclose a method for searching a single structured document, the publications do not disclose a technique adapted to search a specific document from a large volume of structured documents.
The above-described Japanese Patent Application Laid-Open No. 11-242676 discloses a method for searching at high speed under search conditions which include the hierarchical relationship of document parts. However, a hierarchical relationship which can be included in search conditions is limited to a parent-child relationship and a child-grandchild relationship, and the patent publication does not disclose a method which enables a user to include a sibling relationship in search conditions.
A problem which would arise when a sibling relationship between document parts cannot be included in search conditions will be described below.
<Employee><Section> SYSTEM DEVELOPMENT DEPT. </Section><Name> YAMADA TARO </Name></Employee><Employee><Section> GENERAL AFFAIRS DEPT. </Section><Name> SUZUKI HANAKO </Name></Employee>
When searching conditions “SUZUKI HANAKO in SYSTEM DEVELOPMENT DEPT.” are set for searching of such structured documents, the searching conditions are described more specifically such that in a certain document part of <Employee> element, the text data of <Section> element represent “SYSTEM DEVELOPMENT DEPT.” and the text data of <Name> element represent “SUZUKI HANAKO”. In this case, if a sibling relationship between the document parts cannot be included in the search conditions, a user has no choice but to set the search conditions such that the text data of <Section> element represent “SYSTEM DEVELOPMENT DEPT.” and the text data of <Name> element represent “SUZUKI HANAKO”. Therefore, there is a possibility that a search result different from a desired one is obtained.
The above-described Japanese Patent Application Laid-Open No. 11-242676 further discloses a technique for creating a structure index obtained through superposition of structures of structured documents which are to be subjected to searching. In the technique, when the structures of structured documents are superposed, two nodes are regarded to correspond to each other, if the respective upper nodes of the two nodes correspond to each other, the two nodes are of the same element name, and the two nodes are the same in terms of order of appearance in a row of sibling nodes as determined from the head of the row of the sibling nodes with respect to the forward direction. Therefore, the following Document 1 and Document 2 are treated as having completely the same structure and text data.
· Document 1<Document><Part1> STRUCTURING </Part1><Part2> DOCUMENT </Part2><Part3> RETRIEVAL </Part3></Document>· Document 2<Document><Part2> DOCUMENT </Part2><Part1> STRUCTURING </Part1><Part3> RETRIEVAL </Part3></Document>
In other words, although the row of sibling nodes of the same element name is reserved, the order of sibling nodes of different element name is ignored.
Further, in the technique described in Japanese Patent Application Laid-Open No. 11-242676, search conditions are always set to include search keys and structure designation in combination; and this patent publication does not disclose a method in which only structure designation is used as a search condition.
Moreover, in general, when a hierarchical relationship between document parts is retrieved from structured documents accumulated in a large volume, the time required for such retrieval increases with the degree of complexity of the structures of registered documents.