The present invention relates generally to querying electronic documents and, more particularly, to approximate query processing of electronic documents.
Presently, there are vast amounts of electronically stored data, with additional electronic data available daily. As the amount of electronically stored data swells, it will become increasingly difficult to search the electronic data for desired information. Accordingly, faster querying methods for examining electronic data are desirable.
Approximate query processing has recently emerged as a viable solution for dealing with large amounts of electronic data, high query complexities, and stringent response time requirements. Typically, users pose complex queries to a Database Management System (DBMS) which require complex operations to be performed over Gigabytes or Terabytes of disk-resident data and, thus, take a long time to execute to completion. Often an exact answer may not be required, and a user may prefer a fast, approximate answer. For example, during a drill-down query sequence in data mining, initial queries frequently have the sole purpose of determining the truly interesting queries and regions of the database. Providing (reasonably accurate) approximate answers to these initial queries gives users the ability to focus their examinations quickly and effectively, without consuming large amounts of system resources. An approximate answer can also provide useful feedback on the quality of a query, allowing users to make informed decisions on whether they would like to spend more time and resources to execute their queries to completion. For queries requesting a numerical answer (e.g., total revenues or annual percentage), often the full precision of the exact answer is not needed and the first few digits of precision will suffice (e.g., the leading few digits of a total in the millions or the nearest percentile of a percentage.)
One prior art approximate querying method involves sampling. Sampling is based on the use of uniform, random samples to create synopses of the electronic data. The synopses are then queried to produce approximate query answers. Known methods based on sampling involve querying synopses in an attempt to extract desired information from the electronic data. Random samples of a data collection typically provide accurate estimates for aggregate operators (i.e., count, sum, average, etc.) However, random samples for non-aggregate operators (i.e., select, project, join, etc.) may provide undesirable results. For example, sampling techniques suffer from two inherent limitations which restrict their application as a general-purpose approximate query processing tool. First, it is known that a join operator applied on two uniform random samples results in a non-uniform sample of the join result. Thus, join operations lead to degradations in the quality of the approximate answer. Second, for a non-aggregate query, execution over random samples of the data will produce very small subsets of the exact query answer. For example, since sampling produces small subsets of the original data, the desired results may be discarded in the process of generating the samples. Therefore, since sampling is performed prior to querying, a non-aggregate query operator such as select applied on the small subsets of data will produce limited results, if any.
Another known approximate querying method involves the use of histograms. In a histogram method, data is grouped by ranges. By using histograms, synopses of the data can be created which leads to improved querying speeds. However, it is known that histogram-based approaches become problematic when dealing with high-dimensional data sets that are typical of modern systems. The reason is that as the dimensionality of the data increases, both the storage overhead (i.e., number of memory locations) and the construction cost of the histograms that can achieve reasonable error rates increase exponentially.
Accordingly, approximate querying processing methods which efficiently provide fast and accurate query results for aggregate and non-aggregate queries would be useful.
The present invention relates to approximate query methods for querying electronic data. The approximate query method of the present invention comprises: generating wavelet-coefficient synopses of the electronic data; querying the wavelet-coefficient synopses using modified standard (aggregate and non-aggregate) SQL operators implemented using novel query processing algorithms to obtain a wavelet-coefficient result; and rendering the wavelet-coefficient result to obtain an approximate result.
In the present invention, the querying occurs entirely in the wavelet-coefficient domain; that is, both the input(s) and the output of the query operators are compact collections of wavelet coefficients which capture the underlying data. This delays the expansion of the wavelet-coefficient synopses to the end of the querying process, thus allowing for extremely fast approximate querying.
The wavelet-coefficient synopses is generated by decomposing the electronic data into wavelet coefficients. A small set of the wavelet coefficients is then retained (using a threshold scheme) to create a compact, wavelet-coefficient synopses of the original data.