A document processed in a computer and the like is known to be in a form of text, HTML (Hypertext Markup Language), SGML (Standard Generalized Markup Language), or XML (Extensible Markup Language), which draws attention as a next generation HTML. Among these forms of document, HTML, SGML, XML enable a document to include a hierarchical structure using element identifiers “<” and “/>” to be referenced as tags, so that a document can contain more pieces of information than a text form does. Therefore, these forms have been widely used in a computer. As a method for effectively searching a document that contains the hierarchical structure, there has been generally known a method for searching with a query expression for a document or node, which contains a corresponding element to the query expression. As a query expression, an XPath expression to search an XML document is particularly known.
An XPath expression is configured by including a character string consisting of an element and a string of conditions for an attribute (exactly it is a location step) separated by slash “/”. For example, /html/body/p is an XPath expression consisting of three conditions of “html”, “body”, and “p”. In this case, each of “html”, “body”, and “p” is a condition for a name of the element (tag name). When the XPath expression /html/body/p is evaluated against a certain HTML document, an element “p” which exists immediately under an element “body” which exists immediately under an element “html” is searched for. In general, a tree includes a plurality of such elements “p”. Usually, an XPath expression searches a set of nodes in an XML document.
In an XPath expression, an axis may be specified in addition to a condition for a tag name. If no axis such as /html is specified, for example, an element “child” existing immediately under a node (a child of the node) of a tree structure of a node will be designated as an axis. In specifying an axis in an XPath expression, an axis can be specified by syntax such as /decendant::p. “decendant” indicates a descendant element. When an axis is defined as /descendant::p, all descendant “p's” within the tree structure can be searched instead of an element existing immediately below. The axis “descendant” can be abbreviated as “//p” or the like for simplicity.
In an XPath expression, a predicate can also be specified in addition to a condition for a tag name and an axis. A predicate describes a condition that should be satisfied in a node of XML tree during a search of the node. Predicates can be logically connected with “and”, “or”, or “not”.
For a specification of the above-mentioned XPath, W3C specification has been proposed. As an XPath evaluating system, which complies with XPath in W3C specification, Xalan and the like has been known. In the evaluating system, all of an axis and a predicate of XPath can be evaluated. The XPath evaluating system is implemented in a computer and the like by deploying all XML documents on memory by using DOM or other similar data structures. DOM is known as an XML operating interface for deploying all tree structures of an XML document on memory. A system and a method referred to as SAX are also known. SAX is an interface (API) for reading out an XML document sequentially from the top of the document in a form of event. In the present invention, an application interface (API) for reading out an XML document sequentially from the top of the document such as SAX will be collectively called a stream-based API for reference.
With regard to a document-searching system for causing a document to be read out sequentially from the top of it by using the above-mentioned stream-based API, for most cases, a document is read out from left to right for (1) priority in depth and (2) arrangement of nodes in a tree structure expression of an XML document. As a SAX-based-XPath evaluating system, XMLTK, which is disclosed in http://XML. coverpages.org/ni2002-08-26-d.html, and the like have been proposed. The conventional stream-based evaluating systems are inconvenient in that they cannot process the region that fully meets a logical specification accepted in XPath. Much effort has been put to reduce the above-mentioned inconvenience. For example, D. Olteanu et al. try to remove an axis in the opposite direction from XPath in XPath: Looking Forward, http://www.cis.uni-muenchen.de/people/Meuss/Pub/XMLDM02.pdf, which is inadequate in processing a logic in a region that fully meets XPath specification.
Another technical concept of “automaton” consisting of a storage means for holding a symbol, a device for reading out a symbol written in the storage means, and a state-controller is known. An automaton is assumed to read into a symbol written in a storage means, to transit an inner state by using a previous state of a state-controller and a symbol read in, and to finish the process when the latest state held in the state-controller and a transition state match. As being implemented in a computer, an automaton includes a table structure consisting of a plurality of state transitions. The basics of the above-mentioned automaton are described in detail in an exemplary document “Introduction to Automata Theory, Languages, and Computation I, II” by John E. Hopcroft, Jeffrey D. Ullman (translated by Nozaki, Takahashi, Machida, and Yamazaki), published by SAIENSU-SHA, 1986.
In the present invention, a term “a query automaton” refers to an automaton for causing a certain state of an automaton to be distinguished from other states as a search state and for causing a state-control different from that for other states to be performed. Such search automata have been proposed by Neven (F. Neven, Design and Analysis of Query Languages for Structured Documents, PhD Thesis, Limburgs Universitair Centrum, 1999) and the like. Neven's method studies characteristics of search automata and the like with no suggestion made for query automaton's applicability to a stream-based document-searching system and its specific configuration.
Among the above-mentioned conventional evaluating device, an evaluating device such as Xalan complying with W3C specification is configured mainly using DOM. The reasons for this is as follows:    1. XPath expression //a/ancestor::b is an XPath expression pointing to an element “b” that is an ancestor of an element “a” that exists somewhere in a document stored (In the above expression, “//a” is an abbreviation for “/descendant::a”). This type of XPath expression is generally evaluated with an action of descending and then ascending the tree structure. Thus, this type of XPath expression cannot be evaluated by a method for sequentially reading out an XML document from the top.    2. XPath expression //a/preceding-sibling::b is an XPath expression pointing to a following element “b” of an element “a” that exists somewhere in an XML document stored. In the XPath expression, an evaluation is performed with an action of moving rightward and then returning leftward in a tree structure. Thus, this type of XPath expression cannot be evaluated by a method for sequentially reading out an XML document from the top of a document.    3. XPath expression //a[.//b] selects all XML documents that include an element “a” having an element “b” as its descendant. This XPath expression is evaluated in the order of: (i) searching an element “a”; and then (ii) checking if an element “b” is in the descendants. This evaluation can be performed by sequentially checking a document from the top. As an element “a” is selected as a result, information on where an element “a” exists should be stored.    4. XPath expression //a[.//b and.//c] selects an XML document that includes an element “a” having elements “b” and “a” as its descendants. This XPath expression is evaluated in the order of: (i) searching an element “a”; then (ii) checking if an element “b” is in the descendants; and finally (iii) checking if an element “c” exits or not. This evaluation cannot be performed by sequentially checking an XML document from the top, either.
The above reason 4 is referred to as a conjunctive condition. An XPath expression //a[.//b or.//c] selects an XML document that includes an element “a” having an element “b” or “c” as its descendant. This XPath expression is referred to as a disjunctive condition. An XPath expression that only includes a disjunction can be immediately rewritten into, for example, .//a[.//*[name( )=“b” or name( )=“c”]]. Therefore, an evaluation can be performed in a manner of: (i) searching an element “a”; and then (ii) continuing the search until a node whose name is “b” or “c” is encountered in descendants.
Therefore, in conventional stream-based document searches, an XPath expression can be easily sequentially evaluated from the top of the document if neither a special axis (referred to in reasons 1 and 2) nor a special predicate (associated with reasons 3 and 4) is included.
Although a document evaluating system using a conventional DOM has been known as mentioned above, DOM is inconvenient in that it has lower memory efficiency and poorer performance than a stream-based SAX, because DOM needs all provided XML documents to be deployed on memory. Another evaluating system can be assumed as an evaluating system using DOM. The system never causes a necessary part of a document to be read in until the node is about to be looked at. DOM never discards a tree structure of an XML document, which has been read in and constructed, even in such an evaluating device. Some XML documents are too big to be stored in a memory of several gigabytes. It is impractical to create and store a DOM tree for such a big XML document on memory in terms of hardware resources. This has been a constraint on applicability of a document-searching system.
With a naive algorithm using a DOM tree, the same part of a document is repeatedly checked over and over as mentioned above. This provides rather poor efficiency as an evaluation algorithm for an XPath. If, for example, descendant nodes of an element “a” can be checked for “b” in evaluation of //a[.//b], while checking whether the node is another element “a” to be searched, document-searching can be more efficient.
When an XPath expression does not have any special axis or predicate, the XPath expression can be easily evaluated by an event-driven processing system such as SAX as mentioned above. However, from the above reasons 1 to 4, as a manner with a technique of a conventional stream search, all representation such as an axis or predicate cannot be interpreted. This results in problems including: “A” only a condition for a child can be written in a predicate (e.g., an XPath expression //a[.//b] cannot be written), “I” an evaluation of a predicate is currently avoided in XMLTK, and “HA” although omnimark, which is disclosed in http://www.tas.co.jp/XML/tools/omni/omnimark as a famous document conversion device for SGML, has been known, a path evaluation cannot be performed by describing “and” between two conditions (e.g., //a[.//b and .//c] cannot be described), and so on.
It is also possible to obtain //b//a as an equivalent XPath expression by removing an “ancestor” axis from an XPath expression of //a[ancestor::b], for example, by means of a technique for removing an axis in the opposite direction from an XPath expression as shown in Olteanu et al. to solve the above-mentioned problems. However, the above-mentioned method is not a general method and provides a completely different result if, for example, an “ancestor” axis, which is an axis in the opposite direction, is removed from an XPath expression //a[not(ancestor::b)]. Neither document-searching system nor document-searching method for evaluating an evaluating device complying with W3C specification supporting all the above-mentioned problems by using a stream-based API has been known.
Moreover, both a searching system using DOM and a searching system using SAX have a problem that they cannot perform a path evaluation by connecting two conditions with “and”. Conventional searching systems also have a problem that they cannot perform a path evaluation including an axis for a following element (following-sibling, preceding-sibling, etc.). In addition, conventional searching systems have a problem that they cannot process a negative expression of “not”, which indicates that there is no node to meet a predicate.
[Problems to be Solved by the Invention]
The present invention is adapted in view of the above-mentioned problems of prior art. The present invention intends to enable document-searching on the basis of a stream-based API that can interpret all the above-mentioned representation of an axis and predicate, has high versatility against an input query expression, enables an evaluation to be performed with high efficiency, and also enables a hardware resource to be saved. Hereinafter, a stream-based search will be referred to simply as a stream search.