Businesses are increasing generating and storing an large amounts of email, instant messages, audit logs, financial records, and other digital information. In 2006, businesses sent over 3.5 exabytes of email, more than four times the amount in 2004. Such records are valuable assets, and needed for key business operation decisions. They are increasing used as well in internal, regulatory, and litigatory investigations. The retention and maintenance of electronic records is now being mandated by government regulations, e.g., the Sarbanes-Oxley Act, and SEC Rule 17a-4.
Compliance record workloads are quite different from traditional file system or database workloads. For example, compliance records are very likely not going to be queried until years later, and by others than their original creators. As a result, search based lookups considered to be the only feasible way to access such records. Conventional file systems and database workloads use direct metadata based, pathname, or exact query lookups through a SQL query. The differences in the best access methods to use changes how best to store and retrieve such records.
The natural query interface for semi-structured or unstructured business records such as email, memos, notes and reports is keyword search. In a keyword query, the user provides a list of keywords and receives an ordered list of some K documents judged to be the most relevant for that keyword query. Search engines also display a document abstract, which includes the owner, creation time, some keywords, document header etc with each document in the ranked list. The user accesses some of the documents in the top-K list which they think are relevant to their information needs, before reformulating their query or exiting the session if they are satisfied. In a keyword search based access, documents which do not appear in the top-K of the query result are unlikely to be accessed through the query.
Traditional data caching schemes are based on heuristic models of data access. For example, temporal locality models assume that any data block accessed once are likely to be accessed again in near future, and so are good items to be cached. Least recently used (LRU) caching exploits this temporal locality access model by caching records in the order of their recency of access.
A compliance record workloads is also likely to exhibit locality in document accesses and can benefit from caching. For example, keyword queries often exhibit strong locality. After entering a query, users are very likely to reformulate and enter another related query. Reformulated queries are often very similar to the original queries, give or take a few keywords. So, there is likely to be a substantial overlap in the relevant sets of the queries.
Locality is also exhibited across users in the form of popular queries. There is however, a subtle difference between query locality and document access locality. If a user accesses documents ranked first and fifth in one query execution, this does not imply that those documents are more likely to be accessed as compared to documents ranked say second and third in future executions of a related query, by the same user or the same query, or by different users. A user is less likely to access an already clicked document after moving on to a related query. Different users access different documents when the keyword query only approximates a user's document access needs. Different users might not judge the returned documents as relevant for the same query. When a query is run, the caching priority of all the documents relevant for that query should boost up, weighted by the relevance of the document for the query. A simple LRU scheme would only consider the documents actually accessed by the user.