Relational databases store information in collections of tables, in which each table is organized into rows and columns. The various rows of database tables in traditional applications tend to be accessed with, more or less, a uniform frequency. Thus, the vast majority of accesses to a database table in traditional applications are not skewed to a relatively small number of rows. Accordingly, various index structures and caches have been developed for efficiently searching large tables with the assumption that the access pattern is more or less uniform. Specifically, index structures provide an easily searched mapping between row identifiers and key values derived from a column of the corresponding row. Many of these index structures, such as a B-tree index, are characterized by search times that are relatively uniform for each access key.
For applications performing text analysis, on the other hand, the majority of accesses are highly skewed to relatively few rows of a database table. For example, a natural language processing application for interpreting English documents may implement a lexicon using a table that contains a row for every English word. The pattern of accesses to this table is likely to be highly skewed in a Zipf distribution, because a small percentage of English words (around 10%) account for the vast majority (>85%) of words in an English document.
Use of conventional relational database index structures to index this table, however, results in a sub-optimal performance for natural language processing applications, because the search time for very frequently accessed keys is no less than the access time for rarely accessed keys. What is needed, therefore, is a caching methodology such that searching a table of lexical data for frequently accessed keys results in search times that are significantly smaller than search times for rarely accessed keys. Addressing this need is complicated by the fact that the most frequently used words for a specific topic or set of related documents, aside from a relatively small set of about 1500 words, varies greatly from topic to topic. Therefore, it is difficult to statically determine ahead of time the 40,000–60,000 words to put in a lexical cache for a topic that would have an acceptable hit rate of about 95%.
Furthermore, this need is particularly acute as natural language processing applications grow to access huge tables storing lexical entities such as words and phrases. For example, a lexicon may include 600,000 words and phrases, and the industry trend is toward dramatically larger lexicons.