Consumers use their computers primarily for communication and organizing personal information, whether it is traditional personal information manager (PIM) style data or media such as digital music or photographs. The amount of digital content, and the ability to store the raw bytes, has increased tremendously; however, the methods available to consumers for organizing and unifying this data has not kept pace. Knowledge workers spend enormous amounts of time managing and sharing information, and some studies estimate that knowledge workers spend 15-25% of their time on non-productive information related activities. Other studies estimate that a typical knowledge worker spends about 2.5 hours per day searching for information.
Traditional approaches to the organization of information in computer systems have centered on the use of file-folder-and-directory-based systems (“file systems”) to organize files into directory hierarchies of folders based on an abstraction of the physical organization of the storage medium used to store the files. The Multics operating system, developed during the 1960s, can be credited with pioneering the use of the files, folders, and directories to manage storable units of data at the operating system level. Specifically, Multics used symbolic addresses within a hierarchy of files (thereby introducing the idea of a file path) where physical addresses of the files were not transparent to the user (applications and end-users). This file system was entirely unconcerned with the file format of any individual file, and the relationships amongst and between files were deemed irrelevant at the operating system level (that is, other than the location of the file within the hierarchy). Since the advent of Multics, storable data has been organized into files, folders, and directories at the operating system level. These files generally include the file hierarchy itself (the “directory”) embodied in a special file maintained by the file system. This directory, in turn, maintains a list of entries corresponding to all of the other files in the directory and the nodal location of such files in the hierarchy (herein referred to as the folders). Such has been the state of the art for approximately forty years.
However, while providing a reasonable representation of information residing in the computer's physical storage system, a file system is nevertheless an abstraction of that physical storage system, and therefore utilization of the files requires a level of indirection (interpretation) between what the user manipulates (units having context, features, and relationships to other units) and what the operating system provides (files, folders, and directories). Consequently, users (applications and/or end-users) have no choice but to force units of information into a file system structure even when doing so is inefficient, inconsistent, or otherwise undesirable. Because most existing file systems utilize a nested folder metaphor for organizing files and folders, as the number of files increases, the effort necessary to maintain an organization scheme that is flexible and efficient becomes quite daunting.
Several unsuccessful attempts to address the shortcomings of file systems have been made in the past. Some of these previous attempts have involved the use of content addressable memory to provide a mechanism whereby data could be accessed by content rather than by physical address. However, these efforts have proven unsuccessful because, while content addressable memory has proven useful for small-scale use by devices such as caches and memory management units, large-scale use for devices such as physical storage media has not yet been possible for a variety of reasons, and thus such a solution simply does not exist. Other attempts using object-oriented database (OODB) systems have been made, but these attempts, while featuring strong database characteristics and good non-file representations, were not effective in handling file representations and could not replicate the speed, efficiency, and simplicity of the file and folder based hierarchical structure at the hardware/software interface system level.
Newly developed storage systems, such as “WinFS” (described further below) store the directory of the files as table(s) in a database. Each file is represented by a row in a base table, and file system operations, such as “enumerate all files in a directory”, are satisfied using queries against the database engine. Thus, efficiently performing basic operations against the store become operations of efficiently optimizing database queries.
In such storage systems, the concept of a file is extended to that of an “object”. Metadata about the file is stored in a managed CLR (common language runtime) object with a schema (defined in the storage system) to represent the allowable descriptive data for that object. For example, a picture would have a representative CLR object that would store data such as its resolution, time it was taken, and location information. This object model supports data inheritance. With data inheritance, it is possible to derive a type from another and add new fields. For example, a sub-class of the picture could be created, such as “DriversLicensePicture”. Such a sub-class would contain extra information, such as a Driver's License ID field.
In these newly developed storage systems, such as WinFS, the exposed schemas are mapped to tables through a translation layer. Users only see a series of views of the data instead of operating on the base tables. While the exact design of this mapping is not significant, it serves as the glue between the WinFS API and the underlying storage format. Users do not control or see this mapping directly.
The WinFS Store also exposes the concept of querying objects based on their type, as opposed to their file name as in earlier conventional file systems. Type-based queries can search for an exact type or any type that derives from a given type. This latter form is called hierarchical matching, and it is expected to be a common WinFS operation. WinFS also supports searching by file.
WinFS's schema model poses some new challenges to the query processor. User-defined types, or UDTs, are used extensively, and it is common to retrieve all UDTs from a table based on the UDT type. Furthermore, WinFS uses UDT inheritance, and it is also a requirement to retrieve all elements of a given type and also any subtype from a table. Multiple tables exist, each containing a different number of UDTs, types, type topology, and UDT distribution within that topology. Additionally, searching operations may go beyond those operations seen in traditional relational database systems to include, for example, searching of XML documents or performing searches over all fields in an object. These properties make it difficult to make accurate cardinality and cost estimates, and also make it difficult to efficiently retrieve values based on type/subtype hierarchy.
Materialized views (also referred to herein as indexed views) have been a subject of database research for over a decade. The basic idea is to materialize, or store, the result of some query, then use such computed result when similar queries are submitted to the database. For example, it may be desirable to store the result of sales per day, for example, and use the result (this materialized view) in the future to answer several related queries, such as sales in a given month or total sales in the year.
For additional flexibility, applications should not need to be aware that certain views exist, or are materialized. The query processor should identify matches between user queries and existing pre-computed results (materialized views), and use such results when applicable. This is known as the view utilization problem: Given a user query written over base tables, as well as a collection of materialized views, which materialized views can be used to answer such query? And the cost-based variant of the question: Which of those materialized views should be used?
Materialized views are similar to indices, in that they should be part of the physical design of the database and their primary purpose is to improve performance. The logical design of the database, and correctness of applications should be independent of the presence or absence of materialized views. As with indices, materialized views can introduce dramatic improvements in query performance.
Query optimizers are normally structured such that there is an initial simplification stage, followed by exploration of alternatives and cost-based selection of an execution plan, as shown in FIG. 1.
During the simplification/normalization stage 2, some changes are made on the original query Q, such as pushing selections down, or rewriting a subquery as a join, when possible. These modifications are aimed at obtaining a “better” query. Typically, there is no detailed cost estimation at this stage, and a single “better” query Q′ is produced as the result.
The second stage 5 (exploration and cost-based selection) in optimization is directed to generating multiple alternatives, and using a detailed cost model to select the alternative with the cheapest estimated execution cost. Two conventional architectures for the exploration stage are bottom-up, dynamic programming join enumeration, and transformation-driven generation of alternatives. Both architectures set up a table of alternatives, as is well known, which compactly encodes the various possibilities for each sub-expression of a query.
Considering materialized views during exploration comprises augmenting the table of alternatives with entries that use such materialized views. Suppose the original query is a join on tables A, B, C. The typical alternatives (with logical operators only) is given as follows:
ABC: AB   C, AC   B, BC   A, C   AB, B   AC, A   BCAB: A   B, B   ABC: B   C, C   BAC: A   C, C   A
Encoded operator trees are obtained by traversing a table of the alternatives, starting from the root entry (ABC in the query above), and choosing an operator from each entry. For example, by taking the first choice in each entry, the operator tree 10 shown in FIG. 2 is obtained.
Now suppose there is a materialized view V=AB. This means there is a stored table, referred to as Vt, which contains the result of the join of A and B. Because this is a valid way to obtain the join sub-expression, the alternatives is augmented with this alternative, to become:
ABC: AB   C, AC   B, BC   A, C   AB, B   AC, A   BCAB: A   B, B   A, VtBC: B   C, C   BAC: A   C, C   AA valid operator tree 13 that can be generated and considered by the optimizer is shown in FIG. 3.
The mechanism to augment the table of alternatives depends on the optimizer architecture. In the case of a transformation-based optimizer, the extension is obtained by adding a new transformation rule to the system; for bottom-up join enumeration, the construction procedure should be changed. After the alternative is added to the table, the normal optimizer mechanisms to estimate cost, prune out expensive solutions, assemble operator trees, and construct optimal solutions are applied.
To guarantee transactional correctness, the contents of a materialized view must be kept in synch with respect to changes in base tables. For example, when orders are entered, or modified, the materialization of sales per week has to be updated to reflect the changes. This is known as the view maintenance problem. In other words, if underlying data in a base table changes, then the change must be made in the materialized view. It is desirable that the entire materialized view does not need to be re-computed to reflect such a change, because of the re-computation time and expense.
In view of the foregoing deficiencies in existing data storage and database technologies, there is a need for efficient uses of materialized views. The present invention satisfies these needs.