The present invention relates to a data base retrieval system for extracting necessary information from a data base.
In an existing data base searching technique, keyword addition is generally used as a search space compression method. When the number of objective records is relatively small, a full-record search method can be used. For example, the Boyer-Moore method has been proposed as an efficient full-record search method. Furthermore, an index method for automatically extracting a keyword from a search object, and generating an index is also known.
The keyword search method suffers from the following drawbacks:
(1) A keyword must be added to each record; PA1 (2) When arbitrary keywords are added, the number of keywords becomes very large, therefore, management using, e.g., a thesaurus is required, and considerable maintenance costs are required; and PA1 (3) Since keywords to be added are not always proper, a search omission occurs.
More specifically, in the existing data base retrieval method, especially when the number of documents (i.e., the number of records) becomes very large, there is a tendency for performance not to be improved in proportion to required cost.
On the other hand, in a full-record search method, the above-mentioned problems are not posed. However, in an existing direct search method, when the number of records becomes very large, the search time considerably exceeds the interrogation time range, and is not practical. The full-record search method is based on complete coincidence, and cannot perform fuzzy coincidence searching. In the full-record search method based on the Boyer-Moore method, data other than a document (e.g., such as physical time-series data) cannot be processed.
As a method for performing full-record searching, a method disclosed in Japanese Patent Laid-Open No. 3-174652 is known. In this method, an index table, i.e., a character component table using entry characters as indices is formed in advance on the basis of search objective records, thereby narrowing the search range upon execution of full-record searching. However, since full-record searching is performed in the narrowed search range, the search time is long, and fuzzy coincidence searching cannot be performed.
Furthermore, the index method is suitable for documents such as English texts in which words are separated since the unit of information in such documents is a word. In this case, the index method requires some syntax analysis. The index method is not suitable for documents such as Japanese texts in which words are not separated. Furthermore, since a dictionary including all the possible sets of expressional variations of words must be formed, the system load is considerable.
Japanese Patent Laid-Open No. 3-125263 discloses a search method using a plurality of continuous character strings as indices. However, this method also performs complete coincidence searching, and cannot perform incomplete coincidence searching (i.e., fuzzy coincidence searching).
Such a data base retrieval system is required to compress and decode data to decrease the volume of data to be searched and reduce the required memory capacity.
The Huffman method, the Shannon-Fano method, the Gilbert-Moore method, the run-length coding method, and the like are known as typical methods of compressing and decoding data. Japanese Patent Laid-Open No. 2-78323 discloses a technique using the Huffman method.
A method for fixing the size of all the records (e.g., an L-byte length) is known to attain high-speed data storage and reference (access) operations to a data base when data to be searched has a variable length. According to this method, when an n-th record is to be accessed, an n.times.L byte position from the start address of a file can be read, and the storage location can be designated at a high speed. However, in this method, since the record size is set to be constant, insignificant dummy characters must be added to data having a smaller length than the predetermined size, and the data size is undesirably increased.
In contrast to this, according to a method of continuously writing variable-length data in a storage medium, insignificant dummy characters need not be added, and it is not necessary to increase the data size. However, according to this method, since various data record sizes are used, the records must be referred to sequentially in an access mode, and the reference (storage) position cannot be immediately obtained. Therefore, the access speed is decreased.
As described above, the conventional variable-length data storage and reference methods suffer from at least one of two drawbacks, i.e., an increase in data size and a decrease in access speed.
The above-mentioned data base retrieval system checks whether or not records include a search key and lists as a search result data records including the search key.
The list of the search results is formed and preserved. However, when the number of records is large, or when the search results are sequentially preserved, since the volume of data preserved in the list is large, a memory device for storing the data requires a large memory capacity. Since a time required for forming the list of the search results is prolonged, search work efficiency deteriorates.
In the above-mentioned searching operation, when searching is performed using a conditional expression (searching expression) consisting of a plurality of search keys, the conditional expression is formed by the plurality of search keys, and searching is performed using the formed expression. For example, a conditional expression ((A or B or C) and D) is formed by keys A, B, C, and D, and full-record searching is performed using this expression.
However, since such a searching operation uses a conditional expression consisting of a plurality of keys, the search time is very long, and cost performance is low when a condition is not satisfied. When searching is performed using a similar conditional expression, e.g., a conditional expression ((A or B or C) and E) similar to the above-mentioned conditional expression, a partial logical condition (A or B or C) of searching that has already been calculated cannot be reutilized and must be searched again, resulting in poor efficiency.