1. Field of the Invention
The present invention relates generally to text cataloging and text searching in a text control system utilizing a computer. More particularly, the present invention relates to a method for cataloging a structured text in a set of structured texts, each of which has a logical structure, and a method for searching such a set of structured texts for specific text content at a high speed. The invention also relates to a portable medium used in the text cataloging and text searching methods.
2. Description of the Related Art
With the development of the information society making full-scale progress, the amount of electronically-prepared text-based information created by using an apparatus such as a word processor and a personal computer is increasing at an extraordinarily high pace. Under this circumstance, the demand to search a massive collection of cumulated electronically-prepared texts for desired information with a high degree of reliability is rising.
In response to the demand, technology for full-text searching has been developed in which full texts are cataloged in a computer system and treated as a database. Since the database is then searched for a specified string of characters (which is referred to hereafter as a xe2x80x9csearch termxe2x80x9d), a keyword is not required, basically allowing a search operation to be carried out with no detection miss.
A text comprising logical structure elements that can be individually recognized can be treated as an object to be searched in a search operation. Such a text is referred to hereafter as a xe2x80x9cstructured textxe2x80x9d. An example of a structured text is a text described in SGML (Standard Generalized Markup Language (ISO 8879:1986)). In such a search operation, a condition regarding a logical structure is added to a list of search conditions, allowing a search operation with highly detailed search conditions to be carried out.
An example of a search system implementing a search operation specifying a structure condition is disclosed in Japanese Patent Laid-Open No. Hei 8-147311 (JP ""311). In this structured-text searching method, when a text is cataloged, the original of the text is cataloged in a search database. Then, specific character strings representing the head and the end of each logical structure of the cataloged text original are detected to identify logical structures. At the same time, the text is divided into logical structures. The specific character strings representing the head and the end of each logical structure are referred to hereafter as a xe2x80x9cfront markerxe2x80x9d and a xe2x80x9cback markerxe2x80x9d, respectively.
In the case of an electronically prepared specification for a patent application, for example, the front and back markers detected as delimiters of the range of a logical structure called xe2x80x9cAbstract of the Disclosurexe2x80x9d are xe2x80x9c less than SDO ABJ greater than xe2x80x9d and xe2x80x9c less than /SDO greater than xe2x80x9d respectively. The front and back markers are detected to cut out a text delimited thereby as a text of the logical structure. Other logical structures are cut out in the same way in order to divide the original text into logical structures.
Next, a condensed-text creating process is carried out on the original text of each of the resulting logical structures. In the case of the logical structure xe2x80x9cAbstract of the Disclosurexe2x80x9d, for example, the original text is divided into phrase character strings, each of which comprises word units, and a mutual-inclusion relation among the phrase character strings is examined. Then, by eliminating a string of characters included in another phrase character string, a condensed text of the logical structure can be produced. By carrying out the same condensed-text creating process on other logical structures, a condensed text can be formed for each of the other logical structures. The condensed texts are then cataloged in a search database as a condensed-text file.
Then, the binary value xe2x80x9c1xe2x80x9d is set in a bit associated with the code of each character appearing in the text in order to create a character component table, which is also cataloged in the search database as a character component table file.
After the search database has been constructed in this way, text search processing is carried out as follows.
First, a specified search term is disassembled into character units. A text including all characters composing the search term is then extracted by referencing the character component table.
Then, a condensed-text file to be searched, which contains a logical structure specified as a search object, is selected among condensed-text files containing logical structures. By searching the character component table therein, only a condensed text of a text extracted by the operation to search the character component table can be selected as a search object. As a result, a text including the specified search term included in a specified logical structure can be extracted. If no positional relation in the text among a plurality of search terms is prescribed in a specified search condition equation, the search processing is ended. If such a positional relation is specified, on the other hand, the contents of sentences included in a text extracted as a result of the search of the condensed text are read. Only if all the specified search terms are found in the extracted text and, at the same time, the positional relation among the search terms satisfies the specified search condition equation, is the extracted text confirmed as the desired text.
As described above, the search method according to JP ""311 allows a practical speed for a search operation to be maintained for a large-scale database and, at the same time, allows a search operation specifying a structure condition to be carried out.
According to the technology described in JP ""311, a search operation specifying a certain structure condition can be carried out. With this structure specifying technique, however, a search operation satisfying a subtly specified structure condition cannot be carried out in some cases.
In the text cataloging/searching system provided by JP ""311, the structure of a text to be cataloged is divided into sub-structures determined in advance, and a condensed-text file is created for each sub-structure. In a search operation, a file defining a relation associating the names of sub-structures and the names of condensed-text files is referenced to determine a set of condensed-text files to be searched. A search operation specifying a structure condition is then implemented by carrying out the search operation with only condensed-text files in the set treated as a search object.
In text cataloging/searching this system, at a stage of constructing a text database, the designer of the database predicts the structure conditions that are thought to be specified in search operations to be carried out in the future. Then, a text is divided into condensed-text files that allow search operations to be carried out in conformity with the predicted structure conditions. In consequence, however, a search operation that satisfies a structure condition which was not predicted when the database was constructed cannot be carried out.
For example, assume that a text is divided into two logical elements which are each referred to hereafter simply as xe2x80x9celementsxe2x80x9d. Let the two elements be called xe2x80x9cabstractxe2x80x9d and xe2x80x9cmain bodyxe2x80x9d, respectively. Considering that the xe2x80x9cmain body xe2x80x9d element is further divided into any arbitrary number of paragraphs which are each composed of the title of the paragraph and any arbitrary number of sections, if two condensed-text files for the xe2x80x9cabstractxe2x80x9d and xe2x80x9cmain bodyxe2x80x9d elements are created and cataloged in a text database containing a set of texts organized into such a structure in a process of constructing the database, a search operation satisfying a structure condition stating: xe2x80x9cFind a group of sentences in the title of a paragraph that includes a string of characters OOxe2x80x9d cannot be carried out.
Instead of treating the xe2x80x9cmain bodyxe2x80x9d element as a single condensed-text file, the title of each paragraph and the sections composing the element can each be treated as a condensed-text file, allowing a search operation satisfying the structure condition described above to be carried out. Even if such condensed files are provided, however, a search operation will not be able to keep up with structure conditions such as ones stating: xe2x80x9cFind a group of sentences including a string of characters OO inside the first paragraph (which can be either the title of the first paragraph or a section in the first paragraph),xe2x80x9d or xe2x80x9cFind a group of sentences including a string of characters XX in the last section of a paragraph.xe2x80x9d In order to keep up with a structure condition including such a specification of a specific position of a search term, a condensed-text file needs to be provided separately in advance for the appearance of each paragraph and each section. In this case, not only does the number of condensed-text files provided for paragraphs and sections become extremely large because such paragraphs and sections can appear in an element in any arbitrary manner, but a search operation satisfying such a condition cannot be actually carried out because the method described in JP ""311 is not provided with a means for associating a structure condition that includes any arbitrary specification of a position of appearance of a search term with a set of small condensed-text files resulting from finely disassembling each element.
It is thus impossible to include an order of appearance condition in the specification of a structure condition as described above, so that a search operation with a very detailed structure specification cannot be carried out.
It is thus an object of the present invention to solve the problems described above by providing a function for carrying out a search operation specifying a detailed and efficient structure.
In order to solve the problems described above, the present invention provides a text cataloging method that comprises:
(1) an already-analyzed-text data generating/cataloging step of cataloging already-analyzed-text data, which is obtained from an analysis of a logical structure of a text to be cataloged, in a text database;
(2) a structure-index creating step of creating a structure index by sequentially superposing logical structures of texts to be cataloged, one upon another, in the structure index in the same order as the chronological order in which the texts are cataloged, wherein a single metaelement is used for representing a group of elements in the texts having the same position of appearance in one of the texts and the same element type, a single piece of meta-character-string data is used for representing a group of pieces of character-string data in the texts having the same position of appearance in one of the texts, and a context identifier is assigned to each metanode composing a tree-like structure of the structure index for uniquely identifying the metanode, where xe2x80x9cmetanodexe2x80x9d is a generic name for a metaelement and meta-character-string data;
(3) a structured-full-text-data generating step of generating structured-full-text data composed of definitions of associative relations between all pieces of character-string data included in already-analyzed-text data of each text to be cataloged, and context identifiers of pieces of meta-character-string data in the structure index used for representing the pieces of character-string data;
(4) a character-string-index updating step comprising the sub-steps of:
extracting partial character strings each having a predetermined character count, character-position information of the partial character strings in a text to be cataloged, a text identifier for uniquely identifying the text in a text database, and a context identifier of meta-character-string data representing character-string data including the partial character strings in a structure index from the character-string data included in each text to be cataloged;
generating structured-character-position information comprising the character-position information, the text identifier and the context identifier; and
updating a character-string index by cataloging an associative relation between each of the partial character strings and the structured-character-position information in the character-string index.
In addition, the structured-text searching method provided by the present invention comprises:
(1) a structure-condition judging step of searching a structure index for a set of context identifiers satisfying a specified structure condition;
(2) a structured-character-position-information extracting step of extracting partial character strings, each of which has a predetermined character count, from a search term, and searching a character-string index for a set of pieces of structured-character-position information matching the partial character strings; and
(3) an index searching step of searching the set of pieces of structured-character-position information for specific pieces of structured-character-position information that have context identifiers included in the set of context identifiers found at the structure-condition judging step, and that have a positional relation among the specific pieces of structured-character-position information matching the arrangement order of the partial character strings in the search term.