It is well known that there are considerable challenges in dealing with large databases, even where memory and processing power have become relatively cheap. In the world of data warehousing, business intelligence, and computer-based business planning systems, the ever-growing size of the data continues to challenge the computing resources available to users of desk-top terminals or computers, especially where these machines are supported by large and complex server farms as in a more typical client-server environment. Means are continually being sought to reduce the computing requirements, particularly for the client machine, both in terms of memory and processing power so that the available resources can be used effectively.
In a typical situation today, each application software package (and sometimes each user) must be provided with access to an individual copy of the database and its associated meta-data and business rules. Meta-data is the data that describes the data within a data warehouse. Business rules are used to ensure consistency of data across a data warehouse. Although the size of databases in question is often quoted in terms of Megabytes, or even Gigabytes of storage, in the typical data warehouse application more useful metrics are the number of tables, keys and indices. At the time of writing a typical limitation on maximum size of data that can be quickly and easily accommodated on PCs relates to the maximum (virtual) memory address size of 1 Gbyte for Win 98. Newer operating systems (OS) can provide effective memory sizes in excess of this, effectively removing this as a constraint. Nonetheless, even with the availability of large memory machines, there always remain limitations in terms of cost-effectiveness. It therefore becomes critical in large corporate environments that applications share as much data as practicable. As mentioned earlier, it is the growth in the number and size of tables and their indices that is becoming the more important factor. In the environments using data warehouses, the number of tables is usually considered large when it exceeds 2000. A common size is around 100 tables, whereas in exceptionally large cases 20,000 tables are defined. With more and more applications sharing access to a data warehouse, the ability to share the relatively static data contained in such tables has become increasingly important.
The sharing of memory between several users (and sometimes also between applications) has been common for many years, and the approach typically has been to map the data from the disk into the random access memory. Where data contains internal references a complication arises, since these references must be resolved by the application(s) at runtime. This is commonly done using various lookup techniques. Indexing and caching techniques may be used to make this access faster. However, to facilitate these techniques additional resources are required at runtime to access the information from the persisted data (i.e. the data that has been loaded into shared memory). Often the resources needed to access this information is not sharable and is required on a per application, process or user basis.
Another, more sophisticated, approach is to store the references in a form similar to the pointers to data structures typically used by applications to make reference to dynamically allocated memory. Using this approach indices, hashes and other access structures can be stored as part of the data, thus this technique is similar to those used earlier, but different, in that the pointers are persisted in a file on disk, which is mapped into memory. Usually the pointers that are persisted will not point to the same data when the persisted data is loaded by another application later on, since the address space into which the file is mapped is typically different from that of the application that originally accessed the data. In addition, if two or more applications load the same file into shared memory, each application usually maps the file to a different address space. The frequently taken approach to prepare this data for use by specific applications is to reformat the data and adjust (or correct) the various pointers held within the data—a technique known as ‘pointer swizzling’.
Swizzling changes any persisted data references into process-specific data references, thereby limiting the sharing of such data to processes expecting the same, or a very similar, data schema.
Typically, pointers are translated (“swizzled”) from a value in the persisted format to a process-specific value. No extra hardware is required, and no continual software overhead is incurred by presence checks or indirection of pointers. However, the operation does need to be performed whenever all or part of the persisted data is being prepared for access, and each such swizzling operation requires considerable processing power and additional process specific memory. In making the data more process-specific, designers have sacrificed the ability to share data since a copy of the data is required per process. Although the sharing is limited, using shared memory under this scheme is still advantageous, since the Operating System may make use of demand loading of the persisted data across multiple processes, and is thereby able to minimize duplicated I/O.
Typical systems employing swizzling do so on a page by page basis, each page being wholly converted when moved into process specific memory from shared memory or disk. Usually the swizzling of pointers can be constrained to a small area of the application. This gives most of the applications the benefit of treating this persisted data as dynamically allocated data. Programming is simplified, since regular pointer references are used to get at the various pieces of data that came from a persisted file. However the advantages to the applications may not outweigh the sacrifice made by the extra demand on memory, which in any case may be a limiting factor.
In summary, pointer swizzling has the effect that the resultant databases are somewhat customized for particular applications, and therefore such databases do not lend themselves to being shared easily between different applications. Further, extra memory is required since a copy of the data is required per process.