The present invention relates to database systems, and in particular, to techniques for directly loading data into a database.
Structured data often conforms to a type definition. For example, a type definition for a “person” type may define distinct attributes such as “name,” “birthdate,” “height,” “weight,” and “gender.” Each “instance” of a particular type comprises a separate value for each of the attributes defined by the particular type. For example, an instance of the “person” type might comprise values such as “Fred Brown,” “Jan. 1, 1980,” “72 inches,” “240 pounds,” and “male.” Each attribute is also of a type. For example, the “name” attribute might be of a “string” type, the “birthdate” attribute might be of “date” type, and the “gender” attribute might be of an “enumerated” type. Structured data might comprise multiple different instances of the same type.
Different approaches may be used to store structured data into a database. One such approach is called “conventional path loading.” According to conventional path loading, a client application parses structured data that comprises one or more instances of a type. Values within the structured data correspond to attributes of the type. The client application generates Structured Query Language (SQL) commands, such as INSERT commands, that, when executed by a database server, cause the database server to insert the values into corresponding columns of a database table. Unfortunately, due to its heavy use of the SQL engine, conventional path loading often suffers in terms of performance and memory consumption.
Another approach for storing structured data into a database is called “direct path loading.” Through direct path loading, values within structured data are stored directly into a database without going through the SQL engine. By consulting a control file that is associated with the structured data, a client application can determine the types to which instances within the structured data conform. If the structures of the types are defined to the client application, then, based on those structures, the client application can create an array that comprises columns that correspond to the types' attributes. The client application can populate each attribute's corresponding column with values that correspond to that attribute. Once the array is populated, the client application can convert the array into a stream, which the database server can directly convert into the database's data blocks. Direct path loading exhibits performance superior to that of conventional path loading.
Some types indicated by a control file may be standard types that are defined to a client application. A scalar type is an example of such a standard type. The client application has information about the characteristics of a scalar type, such as the maximum storage size of a scalar type. With this information, the client can generate the data stream as described above.
However, some types indicated by a control file might not be among the types that are defined to the client application. A type indicated by a control file might have a structure that is defined only to a program that implements that type. Although the type might comprise attributes that are of standard types, the control file and the client application might lack any information about the number or types of such attributes.
Without such information, the client application cannot generate or populate an array that comprises a separate column for each such attribute. The client application does not possess sufficient information to map values that correspond to such attributes to corresponding columns of a table in a relational database. Consequently, there is no effective way for the client application to store instances of such a type in a database using the direct path loading approach.
Types that are not defined to a client application are called “opaque types” relative to the client application, because the internal structure of such types is obscured from the client application. The internal structure of an opaque type, including the number and types of attributes of the opaque type, often are defined only to a program that implements the opaque type. Such a program may be external to both the client application and the database server.
It may not be practical to modify a client application every time that a new type is introduced, so that the new type is defined to the client application. Additionally, the structures of some existing types may change as time passes. It may be impractical to modify a client application every time that the structure of an existing type changes.
One kind of opaque type is an XML type. An example of an XML type is provided in co-pending U.S. Pat. No. 7,096,224. “XML” stands for “Extensible Markup Language.” An XML schema is metadata that describes a hierarchical structure. Instances of the XML schema comprise data that conforms to the structure described by the XML schema. Through XML elements expressed in the structure, an XML schema defines one or more types. XML elements in such a structure may be mapped to columns of database tables. Using the conventional path loading approach, values that correspond to the XML elements may be stored in the columns that are mapped to those XML elements.
An XML type is special because an XML type may define alternative structures to which instances of the XML type may conform. For example, an XML type definition might indicate that one or more attributes of the XML type are optional. Therefore, if attributes “A,” “B,” and “C” are optional, then one instance of the XML type might comprise a value for attribute “A,” but no values for attributes “B” or “C,” while another instance of the XML type might comprise a value for attribute “B,” but no values for attributes “A” or “C.” Because the instances may conform to alternative defined structures rather than a single defined structure, the instances may be said to comprise “semistructured” data rather than “structured” data.
Related application Ser. No. 10/648,577 describes an approach for efficiently performing direct path loading to store opaque data. Related application Ser. No. 10/648,600 describes an approach for efficiently performing direct path loading to store semistructured data.
Described herein is a method and mechanism for efficiently loading data into a database using any protocol or client. Examples of such clients/protocols include the File Transfer Protocol (FTP) and Hypertext Transfer Protocol (HTTP). In one embodiment, disclosed is a method and system for storing data into a database, where a determination is made if schema metadata that is used to load the data into the database already exists, and where the existing schema metadata is used to load the data into the database if the schema metadata already exists. If the appropriate schema metadata does not exist, then it is generated and cached so that a later load operation for the same schema type will not need to re-generate this information. In this way, the cost to generate the schema metadata is amortized over multiple load operations to load data of the same schema type. The approach is protocol neutral so that multiple different protocol-based loads can operate with the same schema metadata information and load structures.
Further details of aspects, objects, and advantages of the invention are described below in the detailed description, drawings, and claims.