Many enterprises collect large amounts of facts on which they can base business decisions. For example, the facts may be contained in records that are “cookies,” created by a browser as a result of particular actions of users with respect to web pages being processed by the browser. The facts may be characteristics of the particular actions such as, for example, which pages of a particular web site a user has visited. While these facts provide much information about the users' behavior, it can be difficult to process so many facts in order to glean the useful information, in order to make a particular business decision.
An “aggregation-type operation” may be performed to distill a large amount of facts (such as the facts contained in cookies) into some aggregate value that represents an aggregate of the large amount of facts, such that a business decision may be made based on the aggregate value. However, when an aggregation-type operation is to be performed on the large amount of facts, it can be computationally prohibitive to process all of the available facts to accomplish the aggregation-type operation. On the other hand, if the aggregation-type operation is performed on a sampling of the facts, it is (conventionally) difficult to know if the outcome is the same (or the same “enough” to be reliable) as would result from performing the aggregation-type operation on all of the available facts.
Furthermore, where the facts on which the aggregation-type operation is to be performed is a result of joining a plurality of fact sets, prior work suggests that the fact sets must be joined prior to any sampling, rather than joining samplings of the fact sets, in order to obtain correct results. For example, the prior work (see, e.g., Chaudhuri et al., “Overcoming Limitations of Sampling for Aggregation Queries,” IDCE 2001) suggests that the sample and join operations are not commutative. In other words, the prior work suggests that, because the join of sampled fact sets (sample before join) generally does not produce the same outcome (i.e., the same fact records) as the sample of joined fact sets loin before sample), it is undesirable to use the join of sampled fact sets as a basis for making business decisions. That is, according to the prior work, it is undesirable to perform aggregation-type operations on the join of sampled fact sets and to base business decisions on the outcome of such aggregation-type operations, even though to do so would be more computationally efficient.