1. Field of the Invention
The present invention relates to a structured document processing apparatus for managing a plurality of structured documents having different document structures using a structured document database having a hierarchized logical structure, a structured document search apparatus for searching the structured document processing apparatus for a desired structured document, a structured document system including the structured document processing apparatus and the structured document search apparatus, a method, and a program.
2. Description of the Related Art
A structured document database for storing or searching for structured document data described in XML (extensible Markup Language) or the like has been proposed. Using the structured document database, search processing which considers a structure and is hardly attained by a conventional text database can be implemented. In order to apply search processing to this structured document database, a query language (represented by XQuery) to structured documents is used. XQuery is a query language which is standardized by W3C (World Wide Web Consortium). A characteristic feature of the query language lies in that the search results are not those obtained by filtering, and new composite data having a structure can be generated based on documents as a plurality of information sources.
On the other hand, in the field of full-text search, a text database that manages character strings as structure-less documents predominates. As important functions of the full-text search, scoring, wild card, neighboring search, ambiguous search, and the like are known. The text database is often required to conduct search using these functions.
Especially, scoring is an indispensable function in the full-text search. By introducing scoring, the user can acquire some pieces of information (e.g., documents) with higher precision as higher search results, i.e., he or she can quickly acquire only required information.
The structured document database also allows full-text search-like use by designating keywords upon query. However, such function is a prefix search function at most, and the functions such as scoring and the like are not sufficiently considered. Since a structured document has a structure, i.e., it is made up of a plurality of elements, it cannot be acquired for respective documents unlike in the full-text search.
A known score calculation scheme is a tf-idf (term frequency-inverted document frequency) scheme. “tf” indicates the frequency of occurrence of a term in a document of interest, and “idf” indicates the number of documents including that term. “tf” gives higher priority to a term with a higher frequency, and “idf” indicates a measure as to whether or not that term is characteristic. By multiplying these values, scoring is made as a tf-idf value.
Since a structured document is made up of a plurality of elements, the level of scoring becomes important. In recent years, since the structured document database is much in demand, it is expected to implement high-speed scoring in the structured document database.
In order to introduce scoring in the structured document database, scoring precision becomes important, and it becomes important to obtain them using a practical time and resources. That is, problems of “precision” and “speed” become important.
For example, Jpn. Pat. Appln. KOKAI No. 2002-297605 proposes a structured document database which implements scoring in consideration of ambiguity of structures and lexis. In this reference, desired data is generated by calculating synonymous expansion of element names, values, and the like by a semantic network, and calculating similarities of structures and lexical items using “depth information” in the hierarchical relationship.