1. Field of the Invention
The present invention relates to a method for storing and retrieving data from a large data text corpora, and more particularly, to a fault tolerant method of distributing, indexing and retrieving data information in a distributed data retrieval system.
2. Description of the Art
Data storage, indexing and retrieval from large databases becomes increasingly difficult as the databases become more extensive. When the amount of data that needs to be stored and searched exceeds certain limits, maintaining all of it as a single collection with a single index and searching that single index becomes inefficient, and is prone to failures. This is especially true in the case of the World Wide Web as the amount of data available on the Web is exponentially increasing each year. Currently available techniques for searching a database as extensive as the Web for a particular piece of data may yield incomplete results.
In an attempt to address data searching of ever increasing databases, many techniques have been developed. For example, U.S. Pat. No. 5,675,786, issued Oct. 7, 1997 to McKee et al., relates to accessing data held in large computer databases by sampling the initial result of a query of the database. Sampling of the initial result is achieved by setting a sampling rate which corresponds to the intended ratio at which the data records of the initial result are to be sampled. The sampling result is substantially smaller than the initial query result and is thus easier to analyze statistically. While this method decreases the amount of data sent as a result of the query to the end user, it still results in an initial search of what could be a massive database. Further, dependent upon the sampling rate, sampling may result in a reduction in the accuracy of the information sent to the end user and may thus not provide the intended result.
Another example, U.S. Pat. No. 5,642,502, issued Jun. 24, 1997 to Driscoll, relates to a method and system for searching and retrieving documents in a database. A first search and retrieval result is compiled on the basis of a query. Each word in both the query and the search result are given a weighted value, and then combined to produce a similarity value for each document. Each document is ranked according to the similarity value and the end user chooses documents from the ranking. On the basis of the documents chosen from the ranking, the original query is updated in a second search and a second group of documents is produced. The second group of documents is supposed to have the more relevant documents of the query closer to the top of the list. While more relevant documents may be found as a result of the second search, the patent does not address the problems associated with the searching of a large database and, in fact, might only compound them.
Yet another example, U.S. Pat. No. 5,265,244, issued Nov. 23, 1993 to Ghosh et al., relates to a method and apparatus for data access using a particular data structure. The structure has a plurality of data nodes, each for storing data, and a plurality of access nodes, each for pointing to another access node or a data node. Information, of a statistical nature, is associated with a subset of the access nodes and data nodes in which the statistical information is stored. Thus statistical information can be retrieved using statistical queries which isolate the subset of the access nodes and data nodes which contain the statistical information. While the patent may save time in terms of access to the statistical information, user access to the actual data records requires further procedures.
Thus, as can be seen, while attempts have been made to increase efficiency of data storage and retrieval, there still remains a need for an efficient and effective method of distributed information management in a large database.
Accordingly, the present invention is directed to providing a method of enabling effective and efficient storage, indexing, searching and retrieval of data information from a large data text corpora. Regardless of the size of the data text corpora, the present invention allows for data queries to be effectively and efficiently searched and the appropriate data information to be retrieved.
Large global collections of data are broken down into smaller sub-collections. The sub-collections can be stored independently one from the other, as in separate physical locations or simply in separate data tables within the same physical location, and can be connected one to the other through a network. As data are added to the large global collection overall, it can be sent and added to individual sub-collections and/or can be formed into a further sub-collection. For instance, data entered by educational institutions and scientific research facilities can be stored independently in their own data storage facilities and connected to one another via a network, such as the Internet. Thus, as can be seen, the present invention can be implemented with very little or no change in the present protocol for data collection and storage.
Once the individual sub-collections have been identified, each performs its own indexing function. In carrying out the indexing function, each sub-collection creates its own sub-collection view consisting of statistical information generated from what is commonly referred to as an inverted index. An inverted index is an index by individual words listing documents which contain each individual word. The indexing function itself can be carried out in any method. For example, indexing can be performed by assigning a weight to each word contained in a document. From the weights assigned to the words in each document, a sub-collection view (i.e., the statistical information derived from the inverted index) is created upon completion of the indexing function. Regardless of how the sub-collection indexing is carried out, each sub-collection will have its own independent sub-collection view based upon that sub-collection""s inverted index. When data information is added to the sub-collection, the indexing function is carried out again and the sub-collection""s view can be re-compiled from a new inverted index.
Upon completion of each sub-collection view, the sub-collection view is sent to and/or gathered by a global collection custodian. The global collection custodian may either request from each sub-collection that it send its sub-collection view, and/or each of the sub-collections may spontaneously send the sub-collection view to the global collection custodian upon completion. Regardless of whether the views are requested or spontaneously sent, upon collection at the global collection custodian of all of the sub-collection""s views, the global collection custodian builds a xe2x80x9cglobal viewxe2x80x9d on the basis of the sub-collection views. Necessarily, the global view is likely to be different from each of the individual sub-collection views. Once the global view has been compiled, it is sent back to each of the sub-collections.
In this manner then, a distributed data retrieval system is built and is ready for search and retrieval operations. To search for a particular piece of data information, a system user simply enters a search query. The search query is passed to each individual sub-collection and used by each individual sub-collection to perform a search function. In performing the search function, each sub-collection uses the global view to determine search results. In this manner then, search results across each of the sub-collections will be based upon the same search criteria (i.e., the global view).
The results of the search function are passed by each individual sub-collection to the global collection custodian, or the computer which initiated the search, and merged into a final global search result. The final global search result can then be presented to the system user as a complete search of all data information references.
The present invention, including its features and advantages, will become more apparent from the following detailed description with reference to the accompanying drawings.