This invention is related generally to databases, and more particularly to improving query performance in large databases.
Enterprises commonly store enterprise data in large, frequently distributed, database systems. Database systems generally employ indexes to improve query performance. Indexes provide pointers to data in databases that are encompassed within the index definitions, and speed up retrieval of data by allowing queries to locate data quickly that satisfies the queries without having to scan the entire dataset. Creating an index typically involves traversing the content of the database and building an index structure that represents the entire dataset. For large databases, this can be very costly and time consuming, frequently requiring many hours. For certain types of data that are infrequently accessed, for example, only for electronic discovery in litigation using ad-hoc queries, the cost of building an index of the database may be too expensive and not be worth the effort. In other instances, the database may be so large that indexing is impractical or impossible. For example, e-mail archives of large enterprises are frequently so huge that it may be impossible to index all of the existing content. Moreover, an index based upon one ad-hoc query may be of little value for a future ad-hoc query. Thus, if in the absence of a relevant index an ad-hoc query that scans the dataset of a database might require a long time to complete, the next query will have to scan the database again and will similarly take several hours to complete. This results in inefficient query performance.
It is desirable to provide systems and methods that address the foregoing and other problems of ad-hoc query performance in large databases, and it is to these ends that the present invention is directed.