1. Field of the Invention
This invention relates to a data retrieval system and, in particular, to a method of storing data on and retrieving data from a database using a signature file that is divided into subsets, the signature file being automatically created as the data is stored.
2. Description of the Prior Art
Data storage and retrieval systems are known. With the creation of larger and larger databases and the increased use thereof, it is becoming increasingly important to have an accurate method of storing and retrieving the data in a minimum amount of time. It is also important to be able to add information to an existing database without making extensive amendments to the existing information.
In many known databases, the information is highly organized as a carefully indexed structure resident on a storage medium such as a disk. When a portion of the data is to be retrieved, the system makes use of this indexing information in order to locate the required data which may be embedded in a huge collection of similar data items. Indexing information is stored in an extra file. There are various ways of establishing said index. One possible approach is the use of a signature file, the other is the use of an inverted file. The latter approach is often used since it provides fast retrieval but it has two significant shortcomings:
(a) the size of an inverted file is extremely large, being 20% to 100% of the size of the text file itself; PA1 (b) when new data is to be entered into the database, the inverted file must be changed and the highly structured nature of the file makes modification of said file a very time consuming process since significant portions of the file must be altered. PA1 (a) storing the database on the data storage modules; PA1 (b) during loading, said system automatically creating for the database a signature file which is divided into subsets, mapping a word signature to a particular subset during creation of the file and storing said signature file subsets on said data storage modules; PA1 (c) during retrieval, after the signature file is created, entering at least one query word into the system; PA1 (d) said system automatically creating a signature for each query word entered into the system; PA1 (e) scanning for a word signature and retrieving the corresponding data from said database in response to a query word by using the same mapping information that was used to store the word signature in a particular subset, said system matching the signature of a query word with at least one word signature in one subset of said signature word if such an appropriate word signature exists in this subset.
While signature files provide extremely fast update capabilities, they are usually not the index strategy of choice since the retrieval time can be very slow. The slow retrieval time is caused by the scanning of the entire signature file and this is a very time consuming endeavour because the required transfer time from the disk will usually be of a very lengthy duration. The invention to be described uses signature files but with a strategy that make them very competitive with the inverted file approach. Rather than scanning the entire file, a subset of it is scanned instead. This will significantly lower the access time and if the system is carefully designed the modification time is still kept fairly low.
Of course, data can always be retrieved by scanning the entire database (avoiding the use of index files) but that is extremely time consuming and therefore prohibitively expensive.
When the data storage device is a write once optical disk, an update problem occurs with this technology as it is impossible to write information onto a particular area of the optical disk more than once. Therefore, new information that would normally fit adjacent to existing information cannot be placed near that information as there is no space for it at the desired location on the optical disk. Consequently, an index structure such as an inverted file cannot be changed due to this non-erasable attribute of the storage medium. It is possible to simply create a new file in a new area of the disk but this strategy is very inefficient due to the extreme waste of disk space.
When signature files are used, new information added to the database causes the signature file to increase in size since new entries are appended to the file. Since the existing information in the signature file is not changed the use of these files is of considerable advantage in the optical disk environment.
In many cases, the system responds to a user query by retrieving one or more documents that contain one or more words that the user has specified within that query. To accomplish this, indexing facilities are used to specify the locations of required information in the database. By issuing a list of document identifiers which serve to locate the documents that contain these keywords, the index facilities working in conjunction with the query resolution software determines a final list of documents satisfying the needs of the query.
The use of signature files to locate data in a database is a known strategy. A signature file is a condensation of the information in the database. This is accomplished by representing each distinct word in a document of the database with a word signature. When the system is presented with a particular query word, it will derive the word signature that has been associated with that database word. These types of systems then cause the entire signature file to be searched using a serial scan strategy and, subsequently, based on the results of that search, all documents in the database containing the word can be found. This occurs because any word signature in the signature file is followed by a document identifier for the document that contains the word from which the word signature was derived. Consequently, when a word signature in the signature file matches the word signature derived from the query word, during the scan process, the system will capture the accompanying document identifier in order to retain the identity of documents that are pertinent to the query. These systems can still be very time consuming if the entire signature file is searched for each query word.
Optical disks are the most economical means to store databases. However, seek times for an optical disk are typically four to more than thirty times as long as the seek time for a magnetic hard disk. When databases are searched using the inverted file approach, the system may undertake several probes of the index structure with each probe possibly requiring a disk seek or disk arm movement. When a more expensive magnetic hard disk is used, the time requirements are tolerable but can become extremely undesirable when using an optical disk.