Records such as electronic mail (email), financial statements, meeting memos, experimental logs, and quality assurance documents are valuable assets. Key decisions in business operations and other critical activities are based on information in these records. Consequently, these records require maintenance in a trustworthy fashion that is safe from improper destruction or modification while keeping the records readily accessible. Businesses increasingly store these records electronically, making then relatively easy to delete and modify without leaving much of a trace. Ensuring that records are readily accessible, accurate, credible, and irrefutable is particularly imperative given recent legal and regulatory trends.
As critical data are increasingly stored in electronic form, it is imperative that the critical data be stored reliably in a tamper-proof manner. Furthermore, a growing subset of electronic data (e.g., email, instant messages, drug development logs, medical records, etc.) is subject to regulations governing long-term retention and availability of the data. Recent high-profiled accountability issues at large public companies have further caused regulatory bodies such as the Securities and Exchange Commission (SEC) to tighten their regulations. A requirement in many such regulations is that data must be stored reliably in non-erasable, non-rewritable storage such that the data, once written, cannot be altered or overwritten. Such storage is commonly referred to as WORM (Write-Once Read-Many) storage as opposed to WMRM (Write-Many Read-Many) storage, which can be written many times.
However, storing records in WORM storage is inadequate to ensure that the records are trustworthy, i.e., able to provide irrefutable evidence of past events. The key issue is that critical data requires some form of organization such that all of the data relevant to an enquiry can be promptly discovered and retrieved. Scanning all of the data in a large volume of data to discover entries that are relevant to an enquiry is not practical. Instead, some form of a direct access mechanism, such as an index must be built on the data for supporting efficient access.
If an index through which a record is accessed can be suitably manipulated, the record can, for all practical purposes, be hidden or deleted, even if the record is stored in WORM storage. For example, if the index entry pointing to the record is removed or made to point to a different record, the original record becomes inaccessible. Hence, the index itself must be maintained in a trustworthy fashion.
To address the need for a trustworthy index, fossilized indexes have been developed, that are impervious to such manipulations, when maintained on WORM. One such index is the generalized hash tree that supports exact-match lookups of records based on attribute values and hence is most suitable for use with structured data. Although such indexing schemes have proven to be useful, it would be desirable to present additional improvements. Most business records such as email, memos, meeting minutes, etc., are unstructured or semi-structured. The natural query interface for these records is feature (keyword) search, where the user provides a list of features and receives a list of records that contain some or all of the features. Feature based searches are handled by an inverted index.
An inverted index (or index) comprises a dictionary of features and, for each feature, an associated posting list of record identifiers and additional metadata such as feature frequency, feature type, feature position, etc. A trustworthy inverted index requires the posting list entries for a record and a path to those entries to be durable and immutable. This required immutability may be achieved by keeping each posting list in an append-only object (e.g. block, file) in WORM storage. The index can be updated when a new record is added by appending a record identifier (ID) of the new record to the posting lists of all the features contained in the new record. However, this operation can be prohibitively slow, as each append may require a random I/O. For an exemplary set of records in which a record comprises 500 features on average and an append incurs a two msec random I/O, the index update rate could be 1 doc per second.
Conventional approaches for supporting inverted index updates amortize the cost of random I/O, by buffering the index entries of the new records in memory or disk and committing these index entries to the index in batches. Specifically, the features of newly arriving records are appended to an in-memory or on-disk log comprising <feature, record ID> pairs. This log is periodically sorted on feature to create an inverted index for the new records, which is then merged with the original inverted index. Although this technology has proven to be useful, it would be desirable to present additional improvements. Researchers have found that this strategy is effective primarily when a large number of index entries are buffered. For example, over 100,000 records might have to be buffered to achieve an index update rate of 2 records per second.
Buffering creates a time lag, about half a day for the previous example, between the time a record is created to the time the index is updated to include the record. This time lag is inconsistent with maintaining a trustworthy index. Such a time lag provides a window in which an adversary can modify the index by, for example, deleting an index entry while it is still in the buffer, crashing the indexing system and deleting the recovery logs of uncommitted posting list entries, etc.
Keeping the recovery logs on WORM storage also does not guarantee the trustworthiness of the inverted index. Scanning the entire log on every restart is inefficient, while relying on an end-of-committed-log marker is insecure. An adversary can append markers to fool the application into believing that no recovery is required.
The time lag between when a record is compiled and when an adversary may regret the existence of the record is domain-specific and has no a priori lower bound. Furthermore, any delay in committing index entries introduces unnecessary risk and complexity in the compliance process. For example, the prevailing interpretation of e-mail retention regulations is that a regulated e-mail is required to be committed as a record before it is delivered to a mailbox. Thus, generic compliance indexing should not assume any safe time window for committing index entries after returning to the invoking application. A trustworthy index should be updated online, as new records are added.
Search engines answer multi-keyword conjunctive queries (queries in which more than one of the features are required to be contained in the record) by calculating the intersection of the posting lists of the query keywords. To speed up these intersections, additional index structures such as B+ trees are typically maintained on the posting lists. An adversary can effectively conceal a record if the record can be omitted from these posting list indexes and hence such index structures must also be secured by fossilization. Researchers have shown that index structures like B+ trees cannot be fossilized easily. Hence, although B+ trees have proven to be useful in conventional setting, they cannot be directly used in a trustworthy index.
Conventional secure indexing systems, such as Merkle hash trees, authenticated dictionaries etc, have been developed for a threat model in which the data store is untrusted. Merkle hash tree lets one verify the authenticity of any tree node entry by trusting the signed hash value stored at the root node. Authenticated dictionaries support secure lookup operations for dictionary data structures. These conventional systems rely on the data owner to sign data and index entries appropriately. In our model, the all powerful adversary (for example CEO) can assume the identity of the data owner and modify the data/indexes by re-signing them. Hence, although these technologies have proven to be useful in specific threat models, they are inapplicable here.
What is therefore needed is a system, a computer program product, and an associated method for providing inverted index to enable searching of records. The trustworthy inverted index should prevent hiding or modifying of a record through modification of the inverted index. The trustworthy inverted index should be relatively inexpensive with respect to random I/Os and require no time lag between commit of a record and update of the inverted index to include the record. The need for such a solution has heretofore remained unsatisfied.