Conventional database management systems (DBMSs) utilize histograms and other statistics to enable informed internal decisions (e.g., such as to determine on which attributes to build indices, and to plan and execute queries) and to provide approximate query answers for interactive data exploration and visualization. In fact, histograms are a common summarization mechanism for the deterministic data stored in conventional DBMSs, and are often provided as a synopses tool in conventional database query engines. Assuming a one-dimensional data distribution (e.g., capturing tuple frequencies over the domain of an attribute), a histogram synopsis partitions the data domain into a small number of contiguous ranges, referred to as buckets, and stores concise statistics to summarize the tuple frequencies (or probabilities) in each bucket. An example of such a concise statistic used to summarize the tuple frequencies (or probabilities) in each bucket is the value of the average bucket frequency (or probability). Typically, bucket boundaries are chosen to minimize a given error function that measures within-bucket dissimilarities and aggregates errors across buckets (e.g., using summation or maximum).
Unlike conventional DBMSs, a probabilistic DBMS stores and manages probabilistic, or uncertain, data rather than deterministic data. Unlike deterministic data having fixed (i.e., deterministic) attribute values, probabilistic data has at least one attribute that can take on one of many possible attribute values according to some probabilistic relation. As such, a probabilistic DBMS typically specifies the attribute values for a data tuple using a probability distribution over different, mutually-exclusive alternative attribute values, and assumes independence across tuples. Thus, a probabilistic database can be a concise representation for a set of probabilistic data over an exponentially large collection of possible worlds, with each possible world representing a possible deterministic, or grounded, instance of the database (e.g., determined by randomly selecting an instantiation for each probabilistic data tuple according to the data tuple's probability distribution). Because the probabilistic data has at least one uncertain (random) attribute, conventional histogram synopses expecting data with deterministic attributes are generally not applicable in a probabilistic DBMS setting.