1. Field of the Invention
The present invention relates to a method for storing and retrieving data from a large data text corpora, and more particularly, to a fault tolerant method of distributing, indexing and retrieving data information in a distributed data retrieval system.
2. Description of the Art
Data storage, indexing and retrieval from large databases becomes increasingly difficult as the databases become more extensive. When the amount of data that needs to be stored and searched exceeds certain limits, maintaining all of it as a single collection with a single index and searching that single index becomes inefficient, and is prone to failures. This is especially true in the case of the World Wide Web as the amount of data available on the Web is exponentially increasing each year. Currently available techniques for searching a database as extensive as the Web for a particular piece of data may yield incomplete results.
In an attempt to address data searching of ever increasing databases, many techniques have been developed. For example, U.S. Pat. No. 5,675,786, issued Oct. 7, 1997 to McKee et al., relates to accessing data held in large computer databases by sampling the initial result of a query of the database. Sampling of the initial result is achieved by setting a sampling rate which corresponds to the intended ratio at which the data records of the initial result are to be sampled. The sampling result is substantially smaller than the initial query result and is thus easier to analyze statistically. While this method decreases the amount of data sent as a result of the query to the end user, it still results in an initial search of what could be a massive database. Further, dependent upon the sampling rate, sampling may result in a reduction in the accuracy of the information sent to the end user and may thus not provide the intended result.
Another example, U.S. Pat. No. 5,642,502, issued Jun. 24, 1997 to Driscoll, relates to a method and system for searching and retrieving documents in a database. A first search and retrieval result is compiled on the basis of a query. Each word in both the query and the search result are given a weighted value, and then combined to produce a similarity value for each document. Each document is ranked according to the similarity value and the end user chooses documents from the ranking. On the basis of the documents chosen from the ranking, the original query is updated in a second search and a second group of documents is produced. The second group of documents is supposed to have the more relevant documents of the query closer to the top of the list. While more relevant documents may be found as a result of the second search, the patent does not address the problems associated with the searching of a large database and, in fact, might only compound them.
Yet another example, U.S. Pat. No. 5,265,244, issued Nov. 23, 1993 to Ghosh et al., relates to a method and apparatus for data access using a particular data structure. The structure has a plurality of data nodes, each for storing data, and a plurality of access nodes, each for pointing to another access node or a data node. Information, of a statistical nature, is associated with a subset of the access nodes and data nodes in which the statistical information is stored. Thus statistical information can be retrieved using statistical queries which isolate the subset of the access nodes and data nodes which contain the statistical information. While the patent may save time in terms of access to the statistical information, user access to the actual data records requires further procedures.
Thus, as can be seen, while attempts have been made to increase efficiency of data storage and retrieval, there still remains a need for an efficient and effective method of distributed information management in a large database.