1. Field of the Invention
The present invention relates to an apparatus and a method for searching data of a structured document such as an XML (extensible markup language) document, etc.
2. Description of the Related Art
An XML document is a document which is structured by describing each element of document data using a tag, and has a hierarchical structure. When an XML document is represented by a tree structure, each element of the tree is called a node. There are two conventional methods of searching data of an XML document as follows.
(a) A document is expanded by representing all nodes of all documents to be searched by objects of the tree structure. If nodes are searched based on the search request condition, and the condition is satisfied, then the information about a node to be returned is searched for and extracted. This searching method is called an index system.
(b) All documents to be searched are temporarily expanded in a two-dimensional table at a search-return request. At this time, a plurality of nodes which are represented as child nodes of a given node are assigned increased number of table rows as necessary. When the table is searched based on the search request condition and the condition is satisfied, the information about a cell (node) to be returned is extracted.
For example, when two documents as shown in FIG. 1A are to be searched, the document data of the tree structure as shown in FIG. 1B is generated in the method (a) above. It is assumed that the following search expression is input as a search request./doc/Grp{/A=‘X’AND/B=‘1000’}  (1)
The search expression represents the condition that the keyword ‘X’ is contained in the node specified by the path /doc/Grp/A, and the keyword ‘1000’ is contained in the node specified by the path /doc/Grp/B. In this case, by tracing the node of the document data as shown in FIG. 1C, it is indicated that the document 1 satisfies the search request condition.
In the method (b) above, the document data having the table structure as shown in FIG. 1D is generated. In the document 1 shown in FIG. 1A, since there are two different “Grp” nodes as child nodes of the node “doc”, the data of the document 1 is stored in two rows in the table shown in FIG. 1D. In this case, as shown in FIG. 1E, a table is searched by the search expression (1), and the document 1 satisfies the search request condition.
In the methods (a) and (b) above, the XML definition information such as a DTD (document type definition), a schema, etc., the information relating to the relationship between the XML definition information and the XML document, and the information relating to each tag and node in the XML document are stored in addition to each XML document to expand in advance all documents to be searched.
In addition, when structured documents are searched, a hierarchical automaton is generated using a search condition as input, and search can be performed using a generated hierarchical automaton (for example, refer to the Patent Literature 1).
Patent Literature 1: Japanese Patent Application Laid-open No. 2000-90091
However, there are the following problems with the above-mentioned conventional searching methods.
Before performing the searching process, a document to be searched is temporarily to be analyzed. Therefore, when a document to be searched is stored, a very long processing time is required to perform processes such as an analyzing process, an expanding process, etc.
Since the document to be searched is divided into tags and nodes for optimization of the search, a storage area of several times that for the original document is required when the document to be searched is stored.
When the document is searched and analyzed, it is necessary that all or a part of stored document data are temporarily expanded in the memory to identify a node satisfying the search request condition. Therefore, depending on the amount of the stored document data, the amount of consumption of the memory resources for search and analysis largely increases.
A group of XML documents to be searched are to be unified in a standardized format specified according to designated XML definition information based on the logic of storage system. Furthermore, when search is performed, a search expression according to a standardized format is to be used. Therefore, when search is performed on a plurality of different-formatted XML documents, it is necessary to merge the obtained search results after searching XML documents in various formats.