This invention relates to the field of information retrieval, and more particularly to the retrieval of information from a database system for purposes such as data analysis.
Data analysis applications typically use a client system to display condensed or summarized views of large amounts of data stored in a centralized database system (such as a server for example). In retrieving the information to form such views, there is typically a trade-off between the following two variables: (i) the amount of data (e.g. the number of records) that will be returned in a response to a query/request for a particular view (which is typically unknown at the point of making the query/request); and (ii) the amount of resources the client system has to display the data (which is typically unknown by the server system at the time of retrieving the data).
When using existing database query languages, such as SQL or MDX, it is known to specify hard limits on the amount of data (e.g. the number of data items or records returned by the query). Some existing query constructs (such as “SET ROWCOUNT”, “TOP N” or “FETCH FIRST N ONLY”, for example) enable a client system to specify that only a subset of an entire result set should be returned to the client system. For data analysis purposes, such methods of limiting data are very crude, and they can potentially exclude large (and in some cases arbitrary) sections of the data from a query result. This can misrepresent the overall structure of a data set and can potentially lead to wrong data analysis conclusions being drawn.
For example, if one considers the following instance from a SALARIES table (shown as Table 1 below) stored in a database system, where a limited client system can only handle five (5) records at a time, the total number of records in the table may not be known at the time of generating a database query. Further, the total number of unique values in the person column may not be known.
TABLE 1PersonDepartmentSalaryAnnaManagement7BobManagement5ClaireSales5DaveSales6EdwardOffice Staff6FrancisProduction7GregProduction5HenryProduction6IreneProduction5JoeSales5
To obtain an overview of the salaries, but at the same time meet the requirement to limit the size of a result set to five records, one can create a pseudo query such as “SELECT PERSON FROM SALARIES LIMIT 5”. Such a query would return the first records from the table stored in the database, but arbitrarily leave out almost 50% of the total salary paid. Alternatively, one can arrange the query to return the top five largest salaries from the table stored in the database, but this would still leave out 45% of the total salary paid.
Further, for the example above, if the client system can only display at most seven (7) values simultaneously (because of screen size limits for example), a preferred level of detail would involve seven (7) items or less. In a case where more than seven (7) items are returned from the database system in response to a query, the additional information may not be desirable since it cannot be displayed by the client system.
Existing systems attempt to address such limitation in one of two ways:
(i) A query is formulated and the result set (i.e. retrieved information) is interpreted at the client system to see if it meets a predetermined level of detail. Based on the results of this interpretation, additional (altered) queries are formulated in an effort to obtain a preferred the desired level of detail in incremental steps.
(ii) Separate meta-data and data queries are sent to the database system. Based on their results, a final query/request is formulated which is predicted (but not always guaranteed) to provide information with the desired level of detail.
Both of these approaches (and their combinations) result in an undesirable processing overhead in terms of the number of queries transmitted and the amount of data that needs to be transferred between the server and the client system. Also, both require knowledge of data stored by the database system and metadata outside of the query engine. They are also limited in what they can achieve or predict.