The present invention relates to a method of document registration and a method of document search for a document search system or a document management system using a computer system, or more in particular to a method and apparatus for registration and search of a mass of structured documents each having a logical structure, which is capable of searching specific document contents at high speed, and a portable medium used for them.
With the full scale progress of the information society, computerized document information generated using the word processor, the personal computer or the like have increased more than ever before. Under these circumstances, demand is rising for quickly and accurately retrieving a document containing the required information from a vast accumulation of computerized documents.
A technique meeting this demand is the full-text search. In full-text search, the entire text in the document to be registered is loaded in a computer system and converted into a data base, and the data base is searched directly for a specified character string (hereinafter referred to as the query term). This requires no key word and basically makes possible a search free of detection failure.
On the other hand, high-accuracy search can be realized by adding conditions for logic structure to the query (hereinafter referred to as the structure-specified search) intended for documents in which individual logic elements can be identified (hereinafter referred to as the structured document), including a document described in SGML, for example (C. F. Goldfarb: xe2x80x9cTHE SGML HANDBOOKxe2x80x9d Oxford 1993).
A search method permitting the structure-specified search is proposed in JP-A-8-147311 (hereinafter referred to as the well-known example 1). The well-known example 1 will be briefly described below.
In the method of structured document search according to the well-known example 1, a document is registered first as a text directly in a search data base.
Then, a specific character string (hereinafter referred to as the front marker for the well-known example 1) indicating the head of each logic structure of the registered text and a specific character string (hereinafter referred to as the rear marker for the well-known example 1) indicating the tail of each logic structure of the registered text are detected thereby to identify the logic structure while at the same time segmenting the text by logic structure. In the electronically filed patent specification, for example, xe2x80x9c less than SDOABJ greater than xe2x80x9d is detected as a front marker and xe2x80x9c less than /SDO greater than xe2x80x9d as a rear marker indicating the scope of the logic structure xe2x80x9cabstractxe2x80x9d, whereby the text defined by them is cut out as a text corresponding to the xe2x80x9cabstractxe2x80x9d. A similar cut-out work is performed also for other logic structures to segment the text by logic structure.
Then, the text corresponding to each logic structure is condensed, and a condensed text is produced. Specifically, as for the xe2x80x9cabstractxe2x80x9d, the text thereof is segmented into substrings by word, and the inclusion relation is checked mutually between the substrings thus segmented. In the process, the character strings contained in other substrings are removed, thereby producing a condensed test of the xe2x80x9cabstractxe2x80x9d. A similar processing is performed for other logic structures to produce a condensed text by logic structure and registered in the search data base as a condensed text file.
Then, xe2x80x9c1xe2x80x9d is set to a bit corresponding to the character code of the characters appearing in the text to generate a character component table, which is registered as a character component table file in the search data base.
After constructing a search data base in this way, the document search is conducted in the following manner for the well-known example 1.
First, a specified query term is decomposed by character, and the documents containing all the characters constituting the query term are extracted with reference to the character component table.
Then, the condensed text file for the logic structure specified as an object of search is selected among the condensed text files corresponding to logic structures. At the same time, only the condensed text of a document extracted by the character component table search is searched, thereby extracting a document containing the query term specified in the specified logic structure. In the case where the positional relation between a plurality of query terms in the text is not specified in the specified query formula, the search process is terminated. In the case where such a positional relation is specified, on the other hand, the contents of the text corresponding to the document extracted as a result of condensed text search is read, and only those texts containing all the specified query terms and meeting the specified conditions for the positional relation between the query terms are extracted.
In this way, according to the method of the well-known example 1, a structure-specified search is made possible while maintaining a practical search speed for a large-scale text data base.
The prior art disclosed in the well-known example 1 described above makes possible a structure-specified search to some extent. Nevertheless, there may be the case in which search meeting the structural conditions is impossible as intended by the structure-specified search of the well-known example 1.
In the method of the well-known example 1, the structure of a registered document involved is segmented into several predetermined subelements, and a condensed text file is produced for each subelement. At the time of search, a mass of the condensed text files to be searched is determined by reference to a table defining the correspondence between the structure name of the subelement and the condensed text file, and only the condensed text files contained in the particular mass are searched thereby to realize a structure-specified search.
This method estimates a future search specifying the structural condition at the time of constructing a text data base, and segments the condensed text files in such a manner as to permit a search meeting such a condition. Therefore, the search specifying the structural condition not assumed at the time of data base construction is impossible to conduct.
Assume, for example, that a document is configured of two logic elements (hereinafter called the elements) including xe2x80x9cabstractxe2x80x9d and xe2x80x9cbodyxe2x80x9d, and the latter is configured of repetitions of an arbitrary number of xe2x80x9cclausesxe2x80x9d, which in turn includes one xe2x80x9cclause subjectxe2x80x9d and an arbitrary number of xe2x80x9cparagraphsxe2x80x9d. In constructing a text data base from a set of documents having this structure, the condensed text files is segmented into those corresponding to xe2x80x9cabstractxe2x80x9d and those corresponding to xe2x80x9cbodyxe2x80x9d. It is impossible to conduct a structure-specified search meeting the condition that xe2x80x9ca set of documents containing a string XX in the clause subject is determinedxe2x80x9d.
Of course, this condition can be met if instead of making one condensed text file of the whole xe2x80x9cbodyxe2x80x9d, the xe2x80x9cbodyxe2x80x9d is segmented further into xe2x80x9cclause subjectedxe2x80x9d and xe2x80x9cparagraphxe2x80x9d to produce a condensed text file. Even when the file is configured this way, however, it is impossible to meet the structural condition that xe2x80x9ca set of documents containing a string XX in the first clause (clause subject or paragraph) is determinedxe2x80x9d or that xe2x80x9ca set of documents containing a string XX in the last paragraph of a clause is determinedxe2x80x9d. For this structural condition with a specified order is to be met, it is necessary to prepare a condensed text file for each order of occurrence of a clause and a paragraph. In view of the fact that an arbitrary number of clauses and paragraphs can occur, however, the number of the condensed text files would become enormous. In addition, the well-known example 1 lacks means for setting a correspondence between the structural condition containing an arbitrary specification of the order of occurrence and a mass of finely segmented condensed text files. Actually, therefore, the search meeting this condition is impossible.
As described above, in the prior art, the condition for the position of occurrence of the logic elements in a document cannot be included in the specification of the structural condition, and therefore a highly accurate structure-specified search cannot be executed.
An object of the present invention is to solve the above-mentioned problem of the prior art and to provide a function of conducting a highly accurate and efficient structure-specified search.
Further, the prior art described above can realize only the structure-specified search for a set of documents having a predetermined structure.
Specifically, a structure document such as SGML is the one having a structure predetermined by the DTD (document type definition). In the case where a structure-specified search is conducted for a set of documents according to a specified document type definition, therefore, a document is segmented structurally in order to meet all the conditions for structure specification that can occur, thus making a structure-specified search possible.
Nevertheless, there is not only one document type definition. A thesis, a report, etc. for example, has a different document type definition. In this way, a structured document has various document structures for different objects of the document, and a document type definition corresponding to a particular document structure is produced.
These documents are grouped and registered by document type definition, so that the structure-specified search becomes possible for each group. An attempt to realize a search specifying a common structure that can occur for all the groups, however, cannot be achieved unless the structure-specified search is conducted independently for each group and the result is integrated.
On the other hand, standardization of a structured document not necessarily requiring a specific structure like XML (Extended Markup Language) is going one at W3C (World Wide Web Consortium). The probable trend is toward the situation in which the document having a document structure meeting a specific DTD like SGML is not the only object of search.
Further, according to the prior art described above, even structures having the same meaning (type) like xe2x80x9ctitlexe2x80x9d, xe2x80x9csubjectxe2x80x9d are regarded as different structures when the element type name is different. In the structure-specified search in terms of xe2x80x9ca document containing xe2x80x98SGMLxe2x80x99 in xe2x80x98titlexe2x80x99xe2x80x9d, for example, a document meeting the condition xe2x80x9ca document containing xe2x80x98SGMLxe2x80x99 in xe2x80x98subjectxe2x80x99xe2x80x9d cannot be produced as the search result.
Especially when a document type definition is different, different element type names may be attached to the same type of structure for each document type definition.
Assume that a structure-specified search is to be conducted for xe2x80x9ctitlexe2x80x9d, for example. Unless the user specifies element type names meaning xe2x80x9ctitlexe2x80x9d occurring in each document type definition, such as xe2x80x9ctitlexe2x80x9d, xe2x80x9csubjectxe2x80x9d, xe2x80x9cnamexe2x80x9d, xe2x80x9cTITLExe2x80x9d and prepares a query specifying a structure, all the documents required cannot be acquired. Also, unless all the document type definitions of the registered documents are known, all the structures meaning xe2x80x9ctitlexe2x80x9d cannot be covered by the element type name determined by the user. A document according to the document type definition that a title is described in the structure xe2x80x9cTxe2x80x9d, for example, can never be acquired by the structure-specified search by the user not knowing the rule.
Another object of the present invention is to solve the problems mentioned above and to provide a function of highly accurately and efficiently conducting structure-specified search on a set of documents having different document structures coexisting therein.
Further, assume that a condition for the structure-specified search is set as xe2x80x9ca document containing the word xe2x80x98SGMLxe2x80x99 in the title of any item including a chapter, a clause, etc.xe2x80x9d. It is necessary to search all the structures meeting the structural condition xe2x80x9ctitlexe2x80x9d, thereby leading to a reduced search efficiency.
If all the elements down to title are specified sequentially from the base document element such as xe2x80x9c/document/chapter/titlexe2x80x9d as a query, a structure can be efficiently specified. This requires the user, however, to prepare the structure-specified search condition indicating all the structures, like xe2x80x9c/document/chapter/titlexe2x80x9d or xe2x80x9c/document/chapter/clause/titlexe2x80x9d or so forthxe2x80x9d, and thus increases the load on the user. In addition, unless the user grasps all the structures of the document to be searched, a complete search may be impossible.
Still another object of the invention is to solve the problems mentioned above and to provide a function of efficiently realizing a search specifying the same type of structure occurring in a plurality of hierarchical levels without specifying a complicated structural condition.
In order to solve the problems mentioned above, according to the present invention, there are provided a document registration and search method, comprising the following steps.
Specifically, a document registration method according to this invention includes the steps of:
(1) analyzing the logic structure of a document to be registered, generating analyzed document data, and registering the analyzed document data in a document data base;
(2) superpose the logic structures of the documents to be registered, sequentially in the order of registration, causing a single meta element to represent a set of elements having the same position of occurrence in the document and the same type, and causing a single meta string data to represent a set of string data having the same position of occurrence in the document, thereby generating a structure index composed of a structure tree of a set of meta elements and a set of meta string data (hereinafter collectively referred to as the meta-nodes), and attaching to all the meta-nodes constituting the structure index a context identifier for uniquely identifying them in the structure index;
(3) generating structured full-text data composed of the definition of the correspondence between all the string data contained in the analyzed document data corresponding to each document to be registered on the one hand and the context identifier of the meta string data representing the string data in the structure index; and
(4) extracting from the structured full-text data corresponding to each document to be registered, a predetermined substring, character position information of the substring in the document to be registered, a document identifier for uniquely identifying the document to be registered, in the document data base, and a context identifier of the metal string data representing the string data containing the substring in the structure index; generating the structured character position information including the character position information, the document identifier and the context identifier; and registering the correspondence between the substring and the structured character position information thereby to update the string index.
Also, in a document search method according to this invention, the process for searching a registered document includes the steps of:
(1) determining a mass of context identifiers meeting a specified structural condition with reference to the structure index;
(2) extracting a predetermined substring from a query term, and extracting a mass of structured character position information corresponding to the substring with reference to the string index; and
(3) extracting from the mass of the structured character position information the structured character position information having a context identifier contained in the mass determined in the structural condition determining step and having the same positional relation as the arrangement of the substring on the query term.
Further, in a document search method according to the invention, the process for collectively registering documents having a plurality of document structures includes the steps of:
(1) acquiring the type of a particular structure from the element type name with reference to a type definition table describing the correspondence between the name and the type of the structure that can occur in a plurality of structures in the structure index;
(2) acquiring a structure index having the base document element of the same type as the base document element of the document; and
(3) providing a parent node (root meta node) for collecting the structure indexes at the root of the structure index of the documents having a plurality of document structures at the time of registering the structured documents, thereby collecting a plurality of structure indexes into a single meta structure index.
Alternatively, the process for collectively registering documents having a plurality of document structures includes the steps of:
(1) acquiring the type of a particular structure from the element type name with reference to a type definition table describing the correspondence between the name and the type of each structure that can occur in a plurality of structures in a structure index; and
(4) adding a provisional base document element shared by all the documents to the analyzed document data obtained by analyzing the structure of a registered document.
The type definition table is prepared beforehand, manually or automatically by assigning synonyms to the same type using a thesaurus or the like.
Further, in a document search method according to this invention, in order to efficiently realize the structure-specified search specifying the elements of the same type occurring at many positions in the structure index, a document registration program includes the step of:
(1) generating an alias structure index together with a structure index at the time of document registration.
The alias structure index is a structure index prepared so that the information capable of being set for each document structure, such as the date of preparation and the data of updating, can be searched collectively without tracing the structure index. The structure-specified search conducted by specifying the type acquired from the alias structure index enables a plurality of elements in the structure index corresponding to an alias to be acquired collectively from the alias structure index, and therefore the search can be realized more efficiently than when acquiring the context identifier of a specified element by tracing the structure index.