The present invention is directed to an improvement in relational database systems and in particular to the indexing of relational databases to permit efficient relational queries on databases.
In relational database systems, it is important to create indexes on columns of the tables in the database. It is well-known that the efficiency of relational operations such as the JOIN operation or the evaluation of query constraints (SELECTION) is improved if the relevant columns of the table across which the operation take place are indexed.
There have been many approaches to the problem of efficiently creating indexes for relational database tables that support fast access, and that use limited amounts of storage. The B-tree and variations are well-known data structures used for indexing relational databases.
From the point of view of speeding query processing, it is desirable to have available indexes for all columns (and combinations) of all tables in a relational database. However, it is often not advantageous (or even feasible) to do so, since the time required to individually create the indexes, and the storage used by all the indexes after creation, is prohibitive.
It is therefore desirable to simultaneously create a large number of indices on all the tables of a database in a space and time efficient manner.
According to one aspect of the present invention, there is provided an improved index for relational databases.
According to a further aspect of the present invention, there is provided an indexing system for structured or semi-structured source data comprising a tokenizer for accepting source data and generating tokens representing the source data, the tokens from the tokenization representing the source data in a relational view, where for tokens representing a subset of the source data, the system generates tokens identifying the table and column of the subset of the data in the relational view of the source data, and an index builder for building index structures based on the tokens generated by the tokenizer, the index builder creating indexes which comprise a set of positional indexes for indicating the position of token data in the source data, a set of lexicographical indexes for indicating the lexicographical ordering of all tokens, the set of lexicographical indexes comprising a sort vector index and a join bit index, associated with the sort vector index, a set of data structures mapping between the lexicographical indexes and the positional indexes, comprising a lexicographic permutation data structure, the index builder creating a temporary sort vector data structure for generating the lexicographic permutation data structure and the sort vector index.
According to a further aspect of the present invention, there is provided a method for accessing the indexing system to carry out relational queries involving comparisons of data in the source data, the method comprising the steps of accessing the sort vector index for tokens corresponding to source data to be compared, determining, by following the associated join bit index, whether the source data to be compared, as indexed in the sort vector index, matches, signalling whether the source data matches or does not match. According to a further aspect of the present invention, the method comprises the further step of utilizing the positional indexes to return source data when a match is signalled.
According to a further aspect of the present invention, there is provided a method for indexing structured or semi-structured source data comprising the steps of accepting source data and generating tokens representing the source data, the tokens from the tokenization representing the source data in a relational view, where for tokens representing a subset of the source data, the system generates tokens identifying the table and column of the subset of the data in the relational view of the source data, and building index structures based on the tokens generated by the tokenizer, the step of building index structures further comprising the steps of building a set of positional indexes for indicating the position of token data in the source data, building a set of lexicographical indexes for indicating the lexicographical ordering of all tokens, the set of lexicographical indexes comprising a sort vector index and a join bit index, and building a set of data structures mapping between the lexicographical indexes and the positional indexes, comprising a lexicographic permutation data structure, the sort vector index and the lexicographic permutation data structure being built from a temporary sort vector data structure.
According to a further aspect of the present invention, there is provided a computer program product tangibly embodying a program of instructions executable by a computer to perform the above method.
Advantages of the present invention include the provision of indexes for columns of tables in relational databases which require relatively small amounts of storage, and which are capable of being accessed efficiently. A further advantage relates to minimizing disk access to help process queries much faster than traditional SQL products.