This invention relates to data processing systems and more particularly to a system and method which provides for the high-speed search and retrieval of data stored in computer files. The system combines the ability to search text oriented data with the ability to also access data which is stored in record oriented files. In addition, the searching of these two types of files can be done simultaneously.
Using modern technologies, there are now numerous methods for creating computerized, text-oriented document files as well as field-oriented record files. Text files are routinely being originally created or transcribed from paper documents using text processing and desktop publishing systems. This trend has been accelerated by the introduction of scanning systems which can "read" printed texts and automatically translate them into computer files. The number of record files continues to grow in number due to the increasing use of such tools as the off-the-shelf Data Base Management System (DBMS).
With all of this data now in computer files, the problem is now how to quickly access this data and find specific information stored in these files. For text-oriented data files there have been two main methods used. The first method entails the use of real-time pattern matching methods. This means that, at the time the search request is made, a character-by-character search is made through each file for a pattern which matches the search parameters. While this method can be used on small files on small computers or larger files on very fast mainframe computers, it is not very practical for medium or large files on small computers or very large (gigabyte) files on mainframes. The problem is that even simple, single term searches can consume a great deal of computer time resulting in a slow response to the user. The other approach to the retrieval problem with text files is to employ a DBMS to create a database of the files which can be quickly referenced in a manner similar to the way in which DBMS's handle field oriented records. Unfortunately, although this methodology can be relatively fast, it is also expensive in the use of data storage and difficult to use. The DBMS' database file size is usually greater in size (typically 1.25-2 times greater) than the combined size of the original text files. In addition, these DBMS' also consist of thousands of lines of code to administer the complex data files they create. Small computers are strained to support the size of these files and the added size of the DBMS programs necessary to administer them. Also, DBMS' often require the mastery of Query Languages which are specialized to the retrieval of record-oriented data rather than the more simply organized text documents.
For accessing record-oriented files, DBMS' have been the traditional solutions. Again, although they are fast, there are several drawbacks to their use. First, the issue of size of DBMS programs is the same as discussed earlier. DBMS' require a significant overhead because of their size and data storage organization even if the file to be accessed is relatively simple in its organization. Secondly, record-oriented DBMS programs usually cannot handle searches of both text-oriented and record-oriented files with the same query; they normally require the use of multiple queries, leaving the user to manually combine the output. Finally, DBMS programs can normally find and retrieve records in a query only if the entire field content is specified as the parameter for the search. This makes it difficult to do effective searches on files which have fields that are text-oriented documents, however short. A search based on a partial field description such as a few words buried somewhere within such a "text field" cannot normally be accommodated by a DBMS with high-speed response.
With the above described problems evident when using existing methods for searching for data in: 1) text-oriented files; 2) text-oriented files and record-oriented files at the same time; and 3) record-oriented files with free text fields, it is thus apparent that there is a need in the art for an improved search system that allows high speed searching of large files of these types using small computers or small amounts of computing resources.