The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section
The amount and kinds of data available electronically has increased dramatically in recent years. The data resides in different forms, ranging from unstructured data in file systems to highly structured in relational database systems. Often, unstructured data is in the form of digitized sounds or visual images. Semi-structured data is (from a particular viewpoint) neither unstructured data nor strictly structured. Some of the data has structure even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. For example, data that is not a table-oriented format as in a relational model or a sorted-graph format as in object relational database.
The benefit of storing data in a database is that the database provides support for storing large amounts of information and for accessing data in a very flexible and efficient manner. In contrast to the structured content of the relational database, the majority of unstructured or semi-structured content is stored in a file system. Unfortunately, the file system does not support complex data retrieval. For example, the contents of every directory may have to be inspected to retrieve all the documents created on a particular day.
In a relational database system, the relational database system stores information in tables comprised of rows and columns. Each row is identified by a unique RowID. Each column represents an attribute of a record, and each row represents a particular record. Data is retrieved from the database by submitting queries to a database management system. Structured Query Language (SQL) is an example of a database language supported by many existing database management systems. A database system can also receive data definition language statements that define tables, columns and other database structures. In response to receiving such statements, the database creates or modified the database structures and generates data defining the structures.
In spite of unstructured or semi-structured content being stored on a file system, there is a need to efficiently store and manage the content's metadata for efficient access and querying. Metadata describe and define resources that hold unstructured or semi-structured content. For example, file size, creation time or author are all examples of metadata.
An approach for managing such metadata is the system-defined approach. In this approach, a database management system provides, in a repository, a well-defined set of system-defined metadata, also known as system properties, which the database management system is able to maintain as part of the infrastructure. For example, when inserting a row into the database the application determines the creation time and inserts this value into the database. Some of the system-defined metadata is easily accessible and provides efficient queries.
A problem with this approach is that a repository might be used by several different applications that might require their own custom metadata, herein referred to as user-defined meta-data. For example, a user may wish to store additional information in the repository for an image such as the images' resolution or height and width.
This problem was being handled by storing user-defined metadata in some form of packed or chunked storage such as a CLOB. CLOB are datatypes which store large blocks of character data in the database, in-line or out-of-line. Both fixed-width and variable-width character sets are supported. Every CLOB variable stores a locator, which points to a large block of character data. The size of a CLOB cannot exceed four gigabytes. CLOBs are unstructured and do not provide efficient query and storage benefits. Furthermore, the only way to access particular information from the CLOB is to parse the entire package and search for the particular piece of data. This makes the queries slow. Additionally, any updates to this data would also be slow, as the entire package would have to be updated.
One approach that addresses short comings of using CLOBs is to create a functional index on top of the CLOB. With this approach you get some enhanced query ability but the approach can only be used in limited circumstances. For example, functional indexes cannot be created on columns that contain a collection, such as a v-array, and thus this approach will not work for columns that contain v-arrays.
Based on the forgoing, it is clearly desirable to provide a mechanism that provides efficient storage and querying of user-defined metadata and efficiently query it.