The present invention relates to a structured document search method for searching based on a search request including a document logical structure, with respect to a structured document database having the logical structure, a structured document search apparatus and a structured document search system.
Keyword designation is one of method for designating a search request with respect to a document database. In this method, when a user requests a search to the document database in a format of keyword string, a group of documents including the keyword string is returned.
Such a simple and primitive search request method is largely applied to the full text search engine or the like; therefore, there is (1) an accuracy problem that a practically unnecessary group of documents is searched, and (2) a granularity problem that the document containing data other the portion to be used is a data unit.
Recently, structured document codes for structured document such as SGML (Standard Generalized Markup Language) or XML (extensible Markup Language) have been proposed, allowing to realize (1) a search more accurate than the conventional keyword search, and (2) a fine search for obtaining data of the portion to be used, through the designation of search request based on the document structure. However, in this case, the document structure should be unified beforehand to a fixed one, and inconveniently, it is impossible to change the document structure afterward, or to change the document structure for each data.
On the other hand, RDB (Relational DataBase) allows to designate a search request based on the table structure by SQL language. “SQL” is a RDB inquiry language standardized in ANSI X3, 1 and ISO/TC97/SC21/WG3 N117 (1987). However, it is difficult to convert a document structure as it is into a table format, and RDB can not be used as it is as a document database.
Further, a method for applying search languages used in OODB (Object Oriented DataBase) for structured document database such as SGML or XML may be devised. As the structured document has a hierarchical structure, it is considered to be highly compatible with OOBD which takes each component as object. Howsoever, in OODB, the document structure should be decided beforehand by the schema, it is difficult to model by object model, such as arbitrary repetition of child element, and an object-oriented database can not be used as it is as document database.
To resolve such inconveniences, for the document repository, it is proposed to equip SQL with a language processing section to which an expanded function appropriate for the structured document is added. The expanded function appropriate for the structured document includes, first, the path specification for specifying a component in a hierarchical structure. Further, functions expanded based on SQL comprise ambiguous path specification including ambiguities such as regular expression in a path for specifying a component in a hierarchical structure, structure pattern for specification for specifying the pattern of a hierarchical structure, or other functions for absorbing the structural fluctuation proper to the structured document.
There are Jpn. Pat. Appln. KOKAI Publication No. 6-203078, Jpn. Pat. Appln. KOKAI Publication No. 6-301721 and Jpn. Pat. Appln. KOKAI Publication No. 11-15843, proposing methods allowing to specify the search request provided with these characteristics, and to process the search.
Jpn. Pat. Appln. KOKAI Publication No. 6-203078 (information search method and apparatus thereof) proposes a method for storing a path assembly wherein the hierarchical structure is fully developed into the RDB as string table. To search for a structured document, a component in the hierarchical structure is specified by issuing SQL for string comparison of a path in the string table with a search statement's ambiguous path. A problem of this method is that the size of the string table fully developing the hierarchical structure becomes huge, when the number of registered document increases.
Jpn. Pat. Appln. KOKAI Publication No. 6-301721 (full text database search method) proposes a method for deciding the component type beforehand, and making that hierarchical structure's parent-child relationship or links to the actual data RDB for each component as structural information. During the structured document search, the search request is converted into SQL statement. A problem of this method is that the computation amount required for a search processing becomes huge, when the number of registered document increases and the depth and width of hierarchy tree increase, because this search processing method begins from the root element, develops from a parent element to a group of child elements and specifies a component in a hierarchical structure. As the development processing is performed by binding RDB, an unimaginable response time is expected for an implementation system. Especially, this trend becomes remarkable, when an ambiguous path is specified.
Jpn. Pat. Appln. KOKAI Publication No. 11-15843 (SGML document search apparatus and SGML document search method) also decides the component type beforehand, and a document table wherein data is string joined for each component type is established. During the structured document search, the search request is converted into SQL statement. A problem of this method is that it can not be specified but the single stage level path, because data is simply string joined for each component type. Another inconvenience is that the document structure should be decided beforehand, and a flexible search request corresponding to the hierarchical structure that a document possesses can not be issued, and the like.
These methods do not limit the computation amount required for the search processing by combining an index for the data and an index concerning the structure conveniently, making the mechanism difficult to adopt the optimization like as RDB.
As described above, in the prior art, it was difficult to meet, at the same time, two requests in trade-off relationship: (1) to specify various searches for the hierarchical structure a document may possess (including ambiguous path), and (2) to restrict the computation amount required for the search processing.