Persistence stores (databases) as generally known in the art have been specifically designed for specific application domains. For example, relational databases (RDBs) started off being optimized for on-line transactional processing (OLTP) applications, and on-line analytical processing (OLAP) engines were optimized for decision support systems (DSS). To some degree RDBs have evolved to also support DSS applications, but specialized engines still dominate in that domain.
Relational databases (RDBMSs) are the dominant form of commercial databases available today. The key advantage of RDBMSs is a declarative query language that provides flexibility in expressing unanticipated access paths. Although the commercial RDBs have attempted to become “universal databases” usable with each application domain, they have been developed and were optimized for OLTP type applications that have very simple and restrictive data modelling needs and a particular pattern of updates and accesses that are not necessarily applicable to other application domains.
Thus, RDBMS used with OLTP systems do neither provide adequate support for rich flexible data models nor for full text and natural language.
Furthermore, XML databases are known in the art which are based on the concepts of the extensible markup language (XML) used to for document markup. Certain enhancements introduced in XML have been proposed in the art to use XML as a general model for a serialized representation of arbitrary data structures for communication or storage purposes. It has been observed that a means to query data modelled in XML and directly producing XML within a query could reduce some of the issues with the relational models (e.g., the rigid structures for both application modelling and the result structures). A big difference with the XML databases is the use of a similar declarative query specification, even though there is still a primary hierarchical access view.
XML/RDF (resource description framework) databases (e.g., graph databases) are typically implemented on top of relational databases. Support for natural language is thus missing. Therefore, a performance penalty is the result.
The last relevant category of database systems are those built around a graph data model. As is the case with the XML databases these are relatively immature systems in many areas compared with relational systems. However, they have the advantage of not being restricted by the need to support the large legacy of those commercial systems.
Further models include full text systems and specific OLAP systems. However, the full text systems neither support rich flexible data models nor transactions and the OLAP systems neither support full text/natural language specific things nor transactions and also no flexible data models which would imply at least the possibility to delete/change entities.
Transactional persistence stores such as described above rely heavily on a single logical disk based updatable data structures for their value and index storage. All reads during query evaluation primarily access this store. Throughout the transactions any changes made to the global store are seen by the transaction. They work either by writing directly in place to the updatable structures during the transaction and maintaining information to roll back these changes in the event of transaction failure, or by keeping structures for the effect of changes in a transaction in parallel to the global state of the database system. After or when the transaction is committed the changes are made to the updatable structures. In the latter case the union of the disk based structure and the changes are seen by the transaction reads.
Database systems as known in the art usually have three primary disk based storage structures such as persistent transaction logs that are required for recovering from system failure, value stores that contain the current state of the system, and indices that support accelerated evaluation of certain queries. There is, particularly, a broad range of structures that have been developed for the values stores and the indices. In the following, the significant options for value stores and indices such as known in the art will be exemplified.
In general, value stores maintain the state information about the entities and relationships stored in the persistence system. There are two primary forms known in the art for maintaining this state: a row oriented approach or a column oriented approach.
Known relational databases systems have a data model of a set of relations that are highly regular structures. Typically, such systems store each of these tables in rows major order with each row residing in single physical disk pages. In the event that a projection or query evaluation operation only needs to access the elements contained in a single row that row can then be accessed with a single logical Input/Output (IO) operation. Much of the effort in developing such systems has concentrated on attempting to allow many of these accesses to get their information from main memory buffers, avoiding the need for corresponding physical IOs. However, each entity that does not fit within a page will often require additional access during evaluation or projection. This is often the case with enterprise content management (ECM) data in which the entity types do not map well to a simple relational schema, and there are large textual fields that cause even single entities to overflow the space available in single disk pages.
With row oriented stores where full table scans are required during query evaluation (e.g., if there is a predicate on a table for which there is no corresponding index) much IO may be spent on retrieving data that is not required to perform the evaluation. This is generally sequential IO which has a significant performance advantage over random IO. The advantage is large enough that even in the event that an index does exist to support a particular predicate, the table scan may still be more efficient than the use of the index if that would lead to large numbers of random row accesses. This is particularly the case when there is more than one predicate on the same table, one with an index and one without. The information retrieved may also potentially pollute the memory buffers used by the system to avoid other IOs. Much prior art has been developed in the areas of query optimizers and buffer management to mitigate these problems, but there are still situations in which these problems surface.
Another area that has been explored for reducing the intrinsic IO costs noted above involves using various row oriented compression mechanism. The effectiveness of these approaches is constrained by the need to randomly access individual rows. This generally limits the approaches to the use of simple stochastic models yielding compression rates on the order of 2 or 3 to 1.
A further issue with row oriented approaches when dealing with flexible schema structures is that they can result in very sparse relations, with many cells containing NULL values to indicate that the corresponding entry is not set. Again, there has been much work on addressing the costs of such NULLs in the context of row oriented stores, however the cost can still be significant.
Column oriented value stores are often used in OLAP type systems that want to perform aggregation operations on individual fields. For entities with large numbers of attributes the IO requirements for such aggregations can be significantly reduced relative to retrieving all rows in a row oriented value store and pulling out the needed field. However, a column orientation can be very expensive, relative to a row oriented store, for non-memory resident data sets to do projection of large numbers of attributes. To counteract this there are often more compression options available than for row oriented stores.
Another problem that arises with column based stores is the need to update each column for each attribute that is involved in any update. The batch based update mechanism employed by many of the systems that use column oriented value stores gets around these problems but eliminates the ability to perform online transactional updates. For applications, that have flexible schemas, or schemas with many entities with large numbers of attributes, but generally project a relatively small subset of those fields, a materialized view row orientation (that can be seen as a hybrid of row and column orientation) can yield the best performance.
Most specialized text search engines concentrate on their index structures rather than their system state. However, one common projection requirement for text engines is a presentation of what are known as dynamic summaries. The dominant mechanisms for producing such summaries involve the retrieval of the entire document content from some stored form and either a scan of the document looking for some sentences (or passages) that are most relevant to the query, or using location information produced by the indexing system to select some character sequences that contain the text that is most relevant. The latter approach requires the indexing system to identify such sequences which is often not available when document level statistical relevance ranking approaches are employed. For their metadata these text engines employ a general row or column based approach.
Indices are redundant representations of the information in the value stores that support accelerated performance of certain operations, typically selection or filtering but indices have also been specified to accelerate operations such as joins and grouping and aggregation. Filtering indices are basically mappings from one or more keys to the set of entities that match that key in some fashion. Join indices identify pairs of entities that satisfy some join condition. Projection and bit mapped indices are often used to support grouping and aggregation operations.
There are a few important characteristics to be considered for filtering indices. The characteristics are: Does the index cover the entity set with respect to its keys, is the key set disjoint or overlapping, and is the index augmented or a multi-levelled? These characteristics will be described next in more detail.
With respect to the coverage of the entity set, a non-covering index would probably be a lot smaller than a corresponding covering index.
For a disjoint key set query predicates that correspond to multiple keys (e.g., a range predicate over an index in which each key corresponds to a unique value) will need to perform the union of the portions of the index corresponding to each involved key. In overlapping indices query predicate may be constrained by some logical combination of keys. For example a date range index could have keys for individual days, weeks, months and years. The entries for a particular week would be the union of the entries for the days in that week, but the evaluator can avoid the cost of doing the union at query time by using the appropriate week key. Such indices are redundant and so incur both a space and maintenance cost, but for certain predicates this may be worth the expenditure.
Augmented or multi-level indices in which the keys identify the entities that satisfy some predicate, and some form of auxiliary information or auxiliary index supports the identification of subsets of those entities that satisfy some finer grained predicate. For example, some index could identify that some word xyz occurred in one of the attributes of an entity. If such predicates were common this could be a useful index. However, there may also be predicates that are looking for xyz only within attribute A. If this were also common, but less so than the attribute agnostic predicate one could augment the index entries with the identification of the attribute(s) of the entity in which the word occurred, or provide subset indices for particular attributes that provided the same information.
The most common filtering index structure in standard databases is the known BTree which has many variants. The BTree supports log time random updates, although the base of the log with respect to page access (and correspondingly logical IO accesses) is generally very high, and a relatively small main memory “cache” of at least the top levels of the tree can limit average numbers of IOs per random access request to 1 or even lower. However, committing information to the persistence store will require physical writes of all affected pages (usually 1). In some systems such physical updates are postponed to a batch type process, while an in-memory structure maintains the information on the changes since the last batch run. The row/entity references associated with each key can be either logical or physical addresses of the location of the row. Logical addresses provide an additional degree of freedom in re-arranging the referenced entities. BTrees efficiently support both equality and to some degree range predicates. The latter generally requires some re-ordering of the addresses, if there are other predicates for which there are indices or if the corresponding rows need to be retrieved, to avoid disk thrashing.
Among the common variants/implementations of BTrees is a compressed version that takes advantage of the ordering to avoid fully representing the values that repeat between consecutive rows. This can significantly decrease the space needed for an index relative to a naïve representation and provide corresponding increases in the hit rate for a given amount of main memory available to cache the index pages. A variant that is sometimes used with indices over entities with small numbers of fields (e.g., 2-4) is to use the compressed BTree index as the value representation for the extent/table. Recent results have used dictionaries to map the keys in the index tuples into integers identifying elements in some dictionary. This mechanism can further reduce the space costs of the index, with a trade-off of some potential additional cost in maintaining and using the dictionaries.
Other filtering index types are generally optimized for specific data types and the common predicates on such types where these are poorly supported by BTree indices. For example, geographic region types and predicates for overlap and containment are supported by options such as KDTrees.
Inverted indices are primarily used to support fine grained predicates on potentially large textual values. In full text engines such indices also maintain auxiliary information needed to support a fuzzy evaluation of the “relevance” of an entity to a query and some form of phrase or proximity checking. These indices generally produce large numbers of index points for each entity. Although the contents of such indices are similar to the compressed BTree indices with appropriate fields, the large numbers of index points associated with each entity mean that the update costs of individual BTree index entries is generally too expensive. These systems generally then use batch update mechanisms and do not support traditional database concurrency control. In an inverted index the set of entity references associated with a particular key are referred to as the postings list for that key.
With respect to transaction management as known in the art there is a logical view of the isolation between process and the mechanisms used to implement this isolation. According to the prior art, the isolation possibilities are partitioned into four levels: serializable, repeatable read, read committed, and read uncommitted. The implementation mechanisms are generally divided into pessimistic and optimistic concurrency control. The potential concurrency supported generally increases as one descends the isolation levels at the cost of introducing inconsistencies in the database. Pessimistic implementations generally use some form of locking and updates are applied directly to a single updatable store. All transactions read from this single store, and isolation is generally achieved by some appropriate form of locking. Optimistic mechanisms have each transaction maintain some form of lookaside copies of some of the database state, including at least the changes made in that transaction. On commit these changes are checked against conflicts with changes made by other transactions that have been committed. If the check indicates no conflicts the corresponding changes are made to the updatable on-disk data structures that store the current database state. If there is any conflict the transaction is forced to rollback, although that is cheap since it merely involves dumping the lookaside. The costs associated with the lookaside and the conflict resolution depend on the isolation level. In those applications which have a low probability of conflict the optimistic mechanism can be much more efficient than the pessimistic one, although the reverse can be true when there is a high probability of conflict.