By way of background concerning conventional data query systems, when a large amount of data is stored in a database, such as when a server computer collects large numbers of records, or transactions, of data over long periods of time, other computers sometimes desire access to that data or a targeted subset of that data. In such case, the other computers can query for the desired data via one or more query operators. In this regard, historically, relational databases have evolved for this purpose, and have been used for such large scale data collection, and various query languages have developed which instruct database management software to retrieve data from a relational database, or a set of distributed databases, on behalf of a querying client.
Traditionally, relational databases have been organized according to rows, which correspond to records, having fields. For instance, a first row might include a variety of information for its fields corresponding to columns (name1, age1, address1, sex1, etc.), which define the record of the first row and a second row might include a variety of different information for fields of the second row (name2, age2, address2, sex2, etc.). However, conventional querying over enormous amounts of data, or retrieving enormous amounts of data for local querying or local business intelligence by a client have been limited in that they have not been able to meet real-time or near real-time requirements. Particularly in the case in which the client wishes to have a local copy of up-to-date data from the server, the transfer of such large scale amounts of data from the server given limited network bandwidth and limited client cache storage has been impractical to date for many applications.
By way of further background, due to the convenience of conceptualizing differing rows as differing records with relational databases as part of the architecture, techniques for reducing data set size have thus far focused on the rows due to the nature of how relational databases are organized. In other words, the row information preserves each record by keeping all of the fields of the record together on one row, and traditional techniques for reducing the size of the aggregate data have kept the fields together as part of the encoding itself.
It would thus be desirable to provide a solution that achieves simultaneous gains in data size reduction and query processing speed. In addition to applying compression in a way that yields highly efficient querying over large amounts of data, it would be further desirable to gain further insight into complex operations, such as filter and sort operations, over large amounts of data, particularly where a client application or user may only wish to see, or can only see, a small window of data at a time (e.g., as limited by actual display real estate). In such circumstances, performing the filter and sort operations on the back end over the entire data set prior to sending on to the user or client application can be time consuming, and thus inefficient or inadequate in the case of real time applications.
For instance, when a client application requests to display a window over a large amount of data kept in a storage, today, at a logical level, the client application can request it though a query, such as the following pseudo-SQL query:
SELECT SKIP <start_of_window> TOP <window_size><list_of_columns>FROM <table> [JOIN <list_of_tables_to_joins>][WHERE <list_of_predicates>][ORDERBY <list_of_columns>]
To resolve this request, conventionally, the storage layer first sorts and filters the data, and finally uses this ordered and filtered result to return only the rows in the specified window. However, where the amount of data in the ordered and filtered result vastly surpasses the window size, one can see why this approach is inefficient from the perspective of a user who wishes to see only the given window as fast as possible.
In this regard, one problem is that sorting a large amount of data is a very expensive operation, affecting performance of the component that requested the window of data.
One conventional way of solving this problem is to have the storage component ensure that it first applies the filter, and then orders only the results that pass the filter. This ensures that less data needs to be sorted, and helps in general proportion to how selective the filter is, i.e., how much the filter narrows down the target data set to be sorted. However, one can see even this plan does not help if a lot of rows match the filter predicates, since a large number of rows still need sorting, which returns to the original problem.
Another conventional solution is to use a caching mechanism, so that the cost of sorting all of the rows is paid only when user requests the first window of data. Subsequent queries after the first window then have an amortized cost, as the query processor/handler can use the cached result to return different windows over data when filter and order conditions are unchanged. This approach, however, has a relatively high cost in terms of memory since the cache results have to be preserved. While the cache can be evicted or invalidated based on various reasons including memory pressure, usage patterns, etc., the problem remains that at least the initial cost of sorting all rows that pass the filter has to be paid. For time critical queries, this may be unacceptable.
Thus, a fast and scalable algorithm is desired for querying over large amounts of data in a data intensive application environment, particularly queries that implicate expensive filter and/or sort operations over data on a large scale, e.g., a billion rows or more in the target store.
The above-described deficiencies of today's relational databases and corresponding query techniques are merely intended to provide an overview of some of the problems of conventional systems, and are not intended to be exhaustive. Other problems with conventional systems and corresponding benefits of the various non-limiting embodiments described herein may become further apparent upon review of the following description.