1. Field of the Invention
The present invention relates generally to text cataloging and text searching in a text control system utilizing a computer. More particularly, the present invention relates to a method for cataloging a structured text in a set of structured texts, each of which has a logical structure, and a method for searching such a set of structured texts for specific text content at a high speed. The invention also relates to a portable medium used in the text cataloging and text searching methods.
2. Description of the Related Art
With the development of the information society making full-scale progress, the amount of electronically-prepared text-based information created by using an apparatus such as a word processor and a personal computer is increasing at an extraordinarily high pace. Under this circumstance, the demand to search a massive collection of cumulated electronically-prepared texts for desired information with a high degree of reliability is rising.
In response to the demand, technology for full-text searching has been developed in which full texts are cataloged in a computer system and treated as a database. Since the database is then searched for a specified string of characters (which is referred to hereafter as a "search term"), a keyword is not required, basically allowing a search operation to be carried out with no detection miss.
A text comprising logical structure elements that can be individually recognized can be treated as an object to be searched in a search operation. Such a text is referred to hereafter as a "structured text". An example of a structured text is a text described in SGML (Standard Generalized Markup Language (ISO 8879:1986)). In such a search operation, a condition regarding a logical structure is added to a list of search conditions, allowing a search operation with highly detailed search conditions to be carried out.
An example of a search system implementing a search operation specifying a structure condition is disclosed in Japanese Patent Laid-Open No. Hei 8-147311 (JP '311). In this structured-text searching method, when a text is cataloged, the original of the text is cataloged in a search database. Then, specific character strings representing the head and the end of each logical structure of the cataloged text original are detected to identify logical structures. At the same time, the text is divided into logical structures. The specific character strings representing the head and the end of each logical structure are referred to hereafter as a "front marker" and a "back marker", respectively.
In the case of an electronically prepared specification for a patent application, for example, the front and back markers detected as delimiters of the range of a logical structure called "Abstract of the Disclosure" are "&lt;SDO ABJ&gt;" and "&lt;/SDO&gt;" respectively. The front and back markers are detected to cut out a text delimited thereby as a text of the logical structure. Other logical structures are cut out in the same way in order to divide the original text into logical structures.
Next, a condensed-text creating process is carried out on the original text of each of the resulting logical structures. In the case of the logical structure "Abstract of the Disclosure", for example, the original text is divided into phrase character strings, each of which comprises word units, and a mutual-inclusion relation among the phrase character strings is examined. Then, by eliminating a string of characters included in another phrase character string, a condensed text of the logical structure can be produced. By carrying out the same condensed-text creating process on other logical structures, a condensed text can be formed for each of the other logical structures. The condensed texts are then cataloged in a search database as a condensed-text file.
Then, the binary value "1" is set in a bit associated with the code of each character appearing in the text in order to create a character component table, which is also cataloged in the search database as a character component table file.
After the search database has been constructed in this way, text search processing is carried out as follows.
First, a specified search term is disassembled into character units. A text including all characters composing the search term is then extracted by referencing the character component table.
Then, a condensed-text file to be searched, which contains a logical structure specified as a search object, is selected among condensed-text files containing logical structures. By searching the character component table therein, only a condensed text of a text extracted by the operation to search the character component table can be selected as a search object. As a result, a text including the specified search term included in a specified logical structure can be extracted. If no positional relation in the text among a plurality of search terms is prescribed in a specified search condition equation, the search processing is ended. If such a positional relation is specified, on the other hand, the contents of sentences included in a text extracted as a result of the search of the condensed text are read. Only if all the specified search terms are found in the extracted text and, at the same time, the positional relation among the search terms satisfies the specified search condition equation, is the extracted text confirmed as the desired text.
As described above, the search method according to JP '311 allows a practical speed for a search operation to be maintained for a large-scale database and, at the same time, allows a search operation specifying a structure condition to be carried out.
According to the technology described in JP '311, a search operation specifying a certain structure condition can be carried out. With this structure specifying technique, however, a search operation satisfying a subtly specified structure condition cannot be carried out in some cases.
In the text cataloging/searching system provided by JP '311, the structure of a text to be cataloged is divided into sub-structures determined in advance, and a condensed-text file is created for each sub-structure. In a search operation, a file defining a relation associating the names of sub-structures and the names of condensed-text files is referenced to determine a set of condensed-text files to be searched. A search operation specifying a structure condition is then implemented by carrying out the search operation with only condensed-text files in the set treated as a search object.
In text cataloging/searching this system, at a stage of constructing a text database, the designer of the database predicts the structure conditions that are thought to be specified in search operations to be carried out in the future. Then, a text is divided into condensed-text files that allow search operations to be carried out in conformity with the predicted structure conditions. In consequence, however, a search operation that satisfies a structure condition which was not predicted when the database was constructed cannot be carried out.
For example, assume that a text is divided into two logical elements which are each referred to hereafter simply as "elements". Let the two elements be called "abstract" and "main body", respectively. Considering that the "main body" element is further divided into any arbitrary number of paragraphs which are each composed of the title of the paragraph and any arbitrary number of sections, if two condensed-text files for the "abstract" and "main body" elements are created and cataloged in a text database containing a set of texts organized into such a structure in a process of constructing the database, a search operation satisfying a structure condition stating: "Find a group of sentences in the title of a paragraph that includes a string of characters OO" cannot be carried out.
Instead of treating the "main body" element as a single condensed-text file, the title of each paragraph and the sections composing the element can each be treated as a condensed-text file, allowing a search operation satisfying the structure condition described above to be carried out. Even if such condensed files are provided, however, a search operation will not be able to keep up with structure conditions such as ones stating: "Find a group of sentences including a string of characters OO inside the first paragraph (which can be either the title of the first paragraph or a section in the first paragraph)," or "Find a group of sentences including a string of characters XX in the last section of a paragraph." In order to keep up with a structure condition including such a specification of a specific position of a search term, a condensed-text file needs to be provided separately in advance for the appearance of each paragraph and each section. In this case, not only does the number of condensed-text files provided for paragraphs and sections become extremely large because such paragraphs and sections can appear in an element in any arbitrary manner, but a search operation satisfying such a condition cannot be actually carried out because the method described in JP '311 is not provided with a means for associating a structure condition that includes any arbitrary specification of a position of appearance of a search term with a set of small condensed-text files resulting from finely disassembling each element.
It is thus impossible to include an order of appearance condition in the specification of a structure condition as described above, so that a search operation with a very detailed structure specification cannot be carried out.