1. Field of the Invention
The present invention relates to an apparatus and method for retrieving a desired structured document from a structured document database having a hierarchical logical structure that stores a plurality of structured documents having different document structures.
2. Description of the Related Art
For example, a structured document database that stores and manages XML (Extensible Markup Language) data provides means for retrieving a desired structured document using a retrieval request described in a query language. Some query languages have a construction similar to SQL (Structured Query Language), and describe retrieval locations, retrieval conditions, information extraction portions, and the like. However, upon generating query data based on such query language, the user side must have information associated with the DTD (Document Type Definition) of structured documents stored in the structured document database and lexicon generation status.
A lexicon includes many synonyms and similar words. For example, “title” can also be expressed as “heading” or “subject”, and “summary” can also be expressed as “add-up” or “abstract”. However, the conventional query language is too strict to make retrieval that absorbs such lexical ambiguity.
On the other hand, in the field of a document information retrieval (search) engine, a retrieval request is expressed using a keyword string. Some sophisticated document retrieval engines have a function of making a retrieval adding a keyword string associated with a keyword string using a synonym dictionary, similar word dictionary, and the like (broadly interpreting the input retrieval request). Using this function, lexical ambiguity of documents can be coped with. However, documents are simply retrieved while ignoring the document structure as important information of structured documents.
The conventional structured document retrieving scheme suffers the following problems.
(1) A similar object retrieval that considers not only lexical similarity but also similarity of the document structure cannot be made.
(2) A retrieval request which extracts some similar components in a structured document cannot be described unlike the SQL of the database.
(3) Similarity calculations of a lexical item must be made.