Traditional approaches to the organization of information in computer systems have centered on the use of file-folder-and-directory-based systems (“file systems”) to organize files into directory hierarchies of folders based on an abstraction of the physical organization of the storage medium used to store the files. Typically, storable data has been organized into files, folders, and directories at the operating system level. These files generally include the file hierarchy itself (the “directory”) embodied in a special file maintained by the file system. This directory, in turn, maintains a list of entries corresponding to all of the other files in the directory and the nodal location of such files in the hierarchy (herein referred to as the folders). Such has been the state of the art for approximately forty years.
However, while providing a reasonable representation of information residing in the computer's physical storage system, a file system is nevertheless an abstraction of that physical storage system, and therefore utilization of the files requires a level of interpretation between what the user manipulates (units having context, features, and relationships to other units) and what the operating system provides (files, folders, and directories). Consequently, users (applications and/or end-users) have no choice but to force units of information into a file system structure even when doing so is inefficient, inconsistent, or otherwise undesirable. Because most existing file systems utilize a nested folder metaphor for organizing files and folders, as the number of files increases, the effort necessary to maintain an organization scheme that is flexible and efficient becomes quite daunting.
Several unsuccessful attempts to address the shortcomings of file systems have been made in the past. Some of these previous attempts have involved the use of content addressable memory to provide a mechanism whereby data could be accessed by content rather than by physical address. However, these efforts have proven unsuccessful because, while content addressable memory has proven useful for small-scale use by devices such as caches and memory management units, large-scale use for devices such as physical storage media has not yet been possible for a variety of reasons, and thus such a solution simply does not exist. Other attempts using object-oriented database (OODB) systems have been made, but these attempts, while featuring strong database characteristics and good non-file representations, were not effective in handling file representations and could not replicate the speed, efficiency, and simplicity of the file-and-folder-based hierarchical structure at the hardware/software interface system level.
Newly developed storage systems, such as “WinFS” (described further below) store the directory of the files as table(s) in a database. Each file is represented by a row in a base table, and file system operations, such as “enumerate all files in a directory”, are satisfied using queries against the database engine. Thus, efficiently performing basic operations against the store involves efficiently optimizing database queries.
In such storage systems, the concept of a file is extended to that of an “object”. Metadata about the file is stored in a managed CLR (common language runtime) object with a schema (defined in the storage system) to represent the allowable descriptive data for that object. For example, a picture would have a representative CLR object that would store data such as its resolution, time it was taken, and location information. This object model supports data inheritance. With data inheritance, it is possible to derive a type from another and add new fields. For example, a sub-class of the picture could be created, such as “DriversLicensePicture”. Such a sub-class would contain extra information, such as a Driver's License ID field.
In these newly developed storage systems, such as WinFS, the exposed schemas are mapped to tables through a translation layer. Users only see a series of views of the data instead of operating on the base tables. While the exact design of this mapping is not significant, it serves as the glue between the WinFS API and the underlying storage format. Users do not control or see this mapping directly.
The WinFS store also exposes the concept of querying objects based on their type, as opposed to their file name as in earlier conventional file systems. Type-based queries can search for an exact type or any type that derives from a given type. This latter form is called hierarchical matching, and it is expected to be a common WinFS operation. WinFS also supports searching by file.
WinFS's schema model poses some new challenges to the query processor. User-defined types, or UDTs, are used extensively, and it is common to retrieve all UDTs from a table based on the UDT type. Furthermore, WinFS uses UDT inheritance, and it is also a requirement to retrieve all elements of a given type and also any subtype from a table. Multiple tables exist, each containing a different number of UDTs, types, type topology, and UDT distribution within that topology. Additionally, searching operations may go beyond those operations seen in traditional relational database systems to include, for example, searching of XML documents or performing searches over all fields in an object. These properties make it difficult to make accurate cardinality and cost estimates, and also make it difficult to efficiently retrieve values based on type/subtype hierarchy.
Materialized views (also referred to herein as indexed views) have been a subject of database research for over a decade. The basic idea is to materialize, or store, the result of some query, then use such computed result when similar queries are submitted to the database. For example, it may be desirable to store the result of sales per day, for example, and use the result (this materialized view) in the future to answer several related queries, such as sales in a given month or total sales in the year.
For additional flexibility, applications should not need to be aware that certain views exist, or are materialized. The query processor should identify matches between user queries and existing pre-computed results (materialized views), and use such results when applicable. This is known as the view utilization problem: Given a user query written over base tables, as well as a collection of materialized views, which materialized views can be used to answer such query? And the cost-based variant of the question: Which of those materialized views should be used?
Materialized views are similar to indices, in that they should be part of the physical design of the database and their primary purpose is to improve performance. The logical design of the database, and correctness of applications should be independent of the presence or absence of materialized views. As with indices, materialized views can introduce dramatic improvements in query performance.
Query optimizers are normally structured such that there is an initial simplification stage, followed by exploration of alternatives and cost-based selection of an execution plan, as shown in FIG. 1.
During the simplification/normalization stage 2, some changes are made on the original query Q, such as pushing selections down, or rewriting a subquery as a join, when possible. These modifications are aimed at obtaining a “better” query. Typically, there is no detailed cost estimation at this stage, and a single “better” query Q′ is produced as the result.
The second stage 5 (exploration and cost-based selection) in optimization is directed to generating multiple alternatives, and using a detailed cost model to select the alternative with the cheapest estimated execution cost. Two conventional architectures for the exploration stage are bottom-up, dynamic programming join enumeration, and transformation-driven generation of alternatives. Both architectures set up a table of alternatives, as is well known, which compactly encodes the various possibilities for each sub-expression of a query.
The SQL query language provides a user with the ability to query (and modify) tabular data stored using the relational data model. The relational data model dictates that each cell in a table (a column of a row) is a single scalar (or atomic) value. The structured query language (SQL) is an American National Standards Institute (ANSI) standard used to communicate with a relational database. SQL statements are used to perform tasks such as update data or retrieve data from a relational database. Although many database systems use SQL, many of them also have their own additional proprietary extensions that are usually only used on their system. However, the standard SQL commands such as “Select”, “Insert”, “Update”, “Delete”, “Create”, and “Drop” can be used to accomplish many things that one desires to do with a relational database and are believed to be well known to those skilled in the database art.
The SQL Server database management system (DBMS) supports secondary indexes and indexed views built over tables, to speed up certain kinds of data retrieval operations. A secondary index is a data structure organized like a tree, which contains a subset of the columns belonging to the table. An indexed view is the pre-computed result of a query stored into a derived table, which can have secondary indexes as well. Secondary indexes and indexed views are referred to as data structures dependent on the table. Other kinds of dependent objects exist, such as constraints.
When an Insert, Update, or Delete statement is processed, the DBMS has to propagate the change from the table to the dependent structures, so that they are always kept consistent which each other. This allows subsequent data retrieval queries to return the same data independently on whether they access the table, a secondary index, or an indexed view. Thus, whenever modifications are made to the table, they must be made to the secondary indexes and indexed views, as well. The cost of making changes is proportional to the number of places where the change has to be made. While Insert and Delete operations need to be propagated to all the secondary indexes and indexed views, Update statement processing is conventionally optimized to propagate the change only to the data structures that carry columns being modified by the statement. For example, if a secondary index only contains the column C1 of a table T, an update to the C2 column will not be propagated to the index, because its content will be unchanged.
Both secondary indexes and indexed views can be built in SQL Server over both regular table columns and scalar expressions over table columns. For example, it is possible to build a secondary index over the sum of the C1 and C2 columns. Conventionally, whenever an Update statement affects one or more of the columns participating in a scalar expression, the expression will be assumed to be changing, and the change will be propagated to all the dependent data structures that carry this expression.
Support for complex data types has been added to SQL Server through UDTs. The syntax of the SQL Update statement allows granular updates to subfields of a UDT, along with the ability to overwrite an entire column with a new value. However, by using the traditional technique described above for scalar expressions, a granular update to a UDT would be assumed to affect all the data structures dependent on any of its subfields, and thus changes are propagated to dependent data structures whose content might not be modified by the update. This technique is inefficient and expensive. In other words, conventionally, granular updates to UDT columns are assumed to affect all its subfields, thereby degrading performance in the presence of many dependent access paths.
For example, in the prior art, for an Update, it is assumed that if any of the terms in the expression are being changed, then the entire expression is being changed. For example, in the computed column PERSON.ADDRESS.ZIP, when any of PERSON's subfields are updated, it is assumed that the address and zip also change, so the computed column is always updated. This is a very conservative approach, and not desirable.
In view of the foregoing, there is a need for systems and methods that overcome the limitations and drawbacks of the prior art.