1. Field of Art
The present invention generally relates to the field of computer data storage and retrieval, and more specifically, to efficiently representing a set of objects in an in-memory dictionary.
2. Background of the Invention
Computer systems often represent redundant data. For example, natural language strings in documents, names in user databases, and popular music files on multi-user file systems frequently reoccur in practice. As one concrete example, FACEBOOK's social network has over 350 million users, each of whom has two or more names. These individual names overlap from user to user, with popular names like “John” and “Smith” appearing millions of times. Performance requirements dictate that these names reside in main memory for fast access, which requires very large amounts (e.g., gigabytes) of costly RAM.
To economize storage space for such repetitive data, computer systems often compress redundant data. Compression systems compromise among three competing goals: compactness of representation of a datum, the speed with which an original datum can be recovered from its compressed form, and speed of integrating a new datum into the compressed set.
A lookup table (e.g., a dictionary) is one type of compression technique. A lookup table records each object's value in some form of vector, such as an array, and uses an integer storing the object's index offset into the vector as a short identifier. For the majority of the system's data, the object can then be represented by the integer, thereby achieving a compression ratio for that object of(lint/lobj)+(1/N)                where lint is the length of the integer in bytes, lobj is the length of the object data in bytes, and N is the number of occurrences of the object in the system data.Where the object is sufficiently large, or where there is a sufficient number of repeated occurrences of objects, the lookup table provides a high degree of compression and enables rapid recovery of the object value, given the integer representing its associated offset into the vector. However, adding a new object to the lookup table is computationally expensive if a linear scan of the table is performed. Conventional solutions address this problem by employing auxiliary data structures to map object values to locations in the lookup table but do so at the expense of reducing the achievable degree of compression. These conventional solutions additionally require multiprocessor synchronization with respect to the auxiliary data structure, which increases the length of time required to obtain the original object from the lookup table.        