The present invention relates to a registration and a search method for structured documents described in SGML (Standard Generalized Markup Language) or the like. More particularly, the invention is directed to a method of storing and a method of reading the lengths of elements forming a document.
As the information society grows at a rapid pace, an enormous amount of electronic documents are being prepared using word processors and personal computers in recent years. Under such circumstances, there are growing needs for searching documents containing the desired information from mounds of electronic documents. Full-text search is a technical solution to such needs. In the full-text search, the entire texts of documents to be registered is entered into a computer system to create a database at the time of registration, and all the documents containing a string (hereinafter referred to as xe2x80x9csearch termxe2x80x9d) specified by the user is searched from the database at the time of search, so that all the desired documents can be searched reliably without requiring the user to specify a key word during the registration.
On the other hand, a scoring function is proposed, in which the matching degree to specified search conditions is evaluated by giving a score to each of the searched documents, and a list of such documents arranged in the order of given scores is displayed.
The book xe2x80x9cInformation Retrievalxe2x80x9d (written by William B. Frakes and Ricardo Baeza-Yates and published by Prentice Hall) introduces a technique in which the matching degree (nfreqij) is calculated for searched documents using such factors as the occurrence frequency of a specified search term (hereinafter referred to as xe2x80x9csearch term occurrence frequencyxe2x80x9d) in each of the searched documents, the text length of each document and the following equation.
nfreqij=(log2 (freqij+1))/log2 (lengthj)xe2x80x83xe2x80x83Equation 1
where xe2x80x9cfreqijxe2x80x9d is the occurrence frequency of a search term i in a document j; and xe2x80x9clengthjxe2x80x9d is the text length of a document j.
U.S. Pat. No. 5,745,745 discloses a technique in which structured documents containing a search term are searched quickly by preparing a character component table for structured documents.
The related application cited as a cross-reference discloses a technique for registering a structured document by analyzing the hierarchical structure of the document. The application also discloses a technique in which a string index is extracted from a structured document and registered, and in which, at the time of search, a search term is decomposed into substrings and the character positions obtained from a plurality of character indexes are checked to obtain information about which positions in which documents the search term is located.
Each structured document has a unique hierarchical structure of its own. On the other hand, to calculate the matching degree, the element length of a partial logical structure (i.e., an element) or a higher-level logical structure of a structured document is necessary.
The object of the present invention is to obtain the occurrence frequency of a search term and the length of an element to be searched in a structured document quickly.
The present invention provides a registration method for structured documents, comprising the steps of: preparing correspondence data between a string and a string occurrence position within a structured document for each structured document, and additionally storing the correspondence data in an occurrence frequency extracting index, preparing a list of a character, an element containing the character and an element length thereof and additionally storing the list in an element length index at the time of registration, and also provides a search method for structured documents, comprising the steps of: inputting search conditions including a search term and an element for specifying a search range, decomposing the search term into a plurality of substrings, obtaining an occurrence frequency and an occurrence position of the search term using the plurality of substrings from the occurrence frequency extracting index, selecting a character from the search term, obtaining an element containing the character using the character from the element length index, and further extracting a length of the element within the search range; calculating a matching degree for the search conditions from the occurrence frequency and the occurrence position of the search term and the length of the element within the search range; and outputting the element containing the search term and the matching degree.