Many graphical user interfaces have been developed for presenting the contents of a database. Such systems include (a) spreadsheet packages, such as Microsoft Excel and Lotus 1-2-3, (b) database systems, such as Microsoft Access and Microsoft SQL Server, (c) statistical packages such as SAS and SPSS, (d) statistical crosstab analysis packages such as Quantum, (e) business intelligence systems, such as Microstrategy and Hyperion, and (f) data mining systems such as SAS Enterprise Miner.
All of these packages provide graphical user interfaces by which users can query the results of the database and summarize the results in various forms. One common interface is the Structured Query Language (SQL), in which a user writes queries using a text interface and can see the results in a text or tabular view.
Another common approach to presenting the contents of the database is to present data and metadata in tabular or spreadsheet-like views. In the case of data, each row represents a record and each column represents a field; each cell contains the value of that field for that record. In the case of metadata, each row represents a field, and each column represents a property of the field; each cell contains the value of the property for that field. Most database packages offer spreadsheet or tabular views of the data itself.
However, a database can easily contain far too much information for a human analyst to easily explore and interpret the contents. As a result, graphical interfaces generally display not only the database contents directly, but also summaries of the data, such as cross-tabulations, or crosstabs, that summarize the relative frequency with which particular values of one or more fields occur.
There are also other graphical approaches to representing database contents. These include bar charts, line charts, scatter charts, histograms, and time series. Most database packages offer these features directly or support interoperation with other database software packages.
Although these applications allow the user to specify a set of inclusion criteria and formatting of the graphical representation, the graphical summaries are essentially static depictions, and they generally do not allow the user to query the data itself via the graphical representation. For instance, in traditional database reporting applications, a user can choose to view a bar chart of a particular data series. However, clicking on a particular bar in the chart does not allow the user to query other data in the database that is associated with the data represented by the bar. Thus, current applications offer an inefficient means of analyzing data because a user must repeat the steps of creating a particular graphical representation of data many times over in order to organize data in a variety of ways.
While many database interfaces provide some mechanisms for the user to interactively specify what data is to be included in the graphical summaries (for example, Microsoft Excel provides pivot tables that display an interactive crosstab summary of data), such mechanisms are separate interfaces from the graphical views themselves. For instance, in Excel pivot tables the “wizard” used to specify the pivot table appears as a separate interface from the crosstab itself.
Some database applications also provide graphical user interfaces to the metadata. A common graphical approach to representing metadata, rather than the data themselves, is the Entity Relationship Model (ERM). This consists of arcs and nodes. Each node represents a table. Each arc represents a relationship between tables, based on primary and foreign keys. However, these applications do not provide a graphical model in which nodes represent fields rather than tables, and arcs represent statistical relationships rather than foreign-key relationships.
Apart from these typical database applications, are Bayesian Networks and Probabilistic Relational Networks. Bayesian networks can be used for modeling the statistical relationships among variables, and some software packages provide facilities for estimating these models from data in relational databases.
In a Bayesian network, variables are represented as nodes. Each variable can take one of a discrete set of states, although each state can map to a range of continuous values in an underlying database. The node display shows a statistical distribution illustrating the probability of each state, and possibly other statistics such as the mean and standard deviation. These distributions represent marginal probability distributions over a probability space defined by all the nodes in the network.
Some software applications for Bayesian Networks provide a graphical user interface for interacting with the model. Typically, within each node is displayed a graphical representation of the distribution of values underlying the node. For instance, this can be in the form of a bar chart or pie chart. In contrast to traditional database applications, the user can click directly on the nodes via the graphical interface, to enter “findings” that specify constraints on the values of one or more nodes. In other words, the user can click on a state in a node, thus selecting a subset of probability space corresponding to that state. A mathematical inference engine calculates the implications of those constraints and updates the distributions of all affected nodes. As a result, each other node can be automatically updated to reflect the marginal probability distribution of its states over that newly defined subset of probability space.
However, these graphical Bayesian networks do not directly display the contents of the database. Rather, they display models of the database that are estimated from the data, and an inference engine synthesizes the results to calculate the distributions. For any arbitrary set of findings, the distribution of values calculated by the Bayesian Network will generally not equal the distribution of values in the database. For large and/or complex networks, the approximation error due to modeling can be substantial, particularly when the analysis drills down into subsets of the probability space associated with the model. To be sure, it is possible to develop a Bayesian network model in which, for all possible queries, the model results almost exactly represent the distribution of the data used to estimate the model. However, such a Bayesian network would require a number of parameters that increases exponentially with the number of nodes and states in the network and, as a result, is not practical.
The user interfaces for interacting with Bayesian networks provide a convenient means for selecting a subset of possible values and displaying the impact on the distributions of related nodes. Through such graphical interaction, a human analyst is able to explore the interrelationships and gain a clearer understanding of the model. However, such interactive interfaces are lacking in database and data reporting packages. Consequently, there is a need to provide such an interactive interface that enables a user to quickly explore the contents of a database, without the need for estimating models or viewing results that do not exactly match the data.