1. Field of the Invention
The invention concerns graphical user interfaces generally and more particularly concerns graphical user interfaces for data exploration systems.
2. Description of the Prior Art
The computer has permitted organizations to acquire, store, and access vast amounts of data about their operations, their suppliers, their customers, and their employees. The existence of this data has in turn lead to the development of techniques for exploring and analyzing the data and the emergence of a new information specialist: the business data analyst or BDA.
The business data analyst is not without tools. There are dozens of commercial data exploration and analysis tools available available under the overlapping categories of "decision support systems", "executive information systems", analysis environments, and OLAP (on-line analytic processing) tools. See for example the survey of these tools in "A data miner's tools", Byte Magazine, 2(10):91, October 1995. Other tools include ad-hoc query tools and report writers. New commercial tools continue to appear almost weekly.
The academic database research community has also addressed data exploration and analysis. They have done so first with research in difficult technical areas including query optimization, database structure, and advancing the underlying relational data model to handle new types of data. One product of their research has been improved algorithms for dealing with the problems raised in these areas. Additionally, "knowledge discovery in databases" has become an active research area. Knowledge discovery is similar to data mining, but is primarily concerned with using machine learning and statistical approaches to deriving new knowledge from preexisting large corporate and scientific databases.
In spite of all of this activity, the business data analyst still does not have a set of tools that is really well suited to what he or she does. The academic tools, with their emphasis on machine learning, do not take into account the central role of the human data analyst in discovering useful patterns in the information, while the commercial tools are useful for finding information once the analyst knows what he or she is looking for, but do not help the analyst to figure out what part of the data is relevant to the task at hand. Moreover, existing tools cannot be easily combined to form an easy-to-use environment for data exploration and analysis. The following example shows the problems faced by a business analyst who employs the tools presently available:
AT&T Corp markets a variety of telecommunications services. The marketing activities include promotions, on-going advertisement, new service offerings, new equipment offerings, bundled offerings, etc. Of course, AT&T's competitors are engaged in the same kinds of activities. AT&T is vitally interested in understanding the general market reaction to these efforts; doing so is surprisingly difficult. While AT&T has many large databases containing billing and customer premise equipment information, it is still difficult to find the right data and interpret it in the right context to glean the appropriate business insight. It is the task of the business data analyst to use this data to answer various business questions.
The task is made more difficult at AT&T the sheer volume of the data. A data file, which combines data from many sources, might have 15 million records and take up 1/2 a gigabyte of storage. AT&T has many hundreds of such data files. For this reason, the data is not read into a relational data base. Instead, AT&T keeps most of the data files on 8 mm tape until they are needed, at which point they are read into flat files of the type used in the UNIX operating system for processing (UNIX is a trade mark of the X Open Foundation).
The tools presently used in AT&T to explore and analyze this data are the following:
a small set of utilities provided with the UNIX operating system, including "grep", "sort", "unique"; PA1 programs written in programming languages like C or AWK; PA1 statistical packages like S; and PA1 tree induction routines. PA1 Run custom AWK script to divide base file by credit history into 4 segment files. PA1 Pick smallest segment file for initial exploration. PA1 Visually scan data to get a feeling for number of nulls in the revenue field. PA1 If it seems high, run a small script to actually count them. If still high, note down. PA1 Decide to examine revenue by region--run a small script to translate data file into files that S can read. PA1 Drop into S to do the graphing, potentially customize the graph using the S language. PA1 Note that one region has an "interesting" value (perhaps much higher than expected). PA1 Extract the records with that region (by running a small script) into a new file. PA1 Examine some other attribute of that file, using S, and create a graph "really" worth saving. PA1 Try to go back and "do the same thing" to all of the categories created, or some combinations of the categories (which, in this example, is credit history by region by revenue, with several other attributes). PA1 use of a data base to store not only the data being investigated, but also persistent representations of the directed graphs; PA1 a client-server architecture in which the data base operations are performed in the server and the client displays the directed graphs, provides data base queries derived from the directed graphs to the server, and receives the tables resulting from those queries; and PA1 lazy evaluation of the operations specified in the directed graph, with evaluation being done only when the user specifies execution of a branch of the graph and with the results of operations being encached in the representation of the graph, so that a branch need be evaluated only from the point at which an encached result is available.
These tools are used under the X window system. The main reason these tools are used instead of a data base system is the quantity of data to be analyzed. With really large amounts of data, it is typically much faster to do analysis on a flat file than to use a data base system. That is particularly the case if the calculations involved in the analysis are well-understood and can be done on one pass through the data. The price paid for this speed is a lack of the "meta-data" support which is typically provided by a data base system: a flat file has no inherent structure, no information on the semantics or types of the data in the fields, and no integrity checking.
A typical one to two hour exploration and analysis session at AT&T involves operations like the following:
As is apparent from the foregoing, the work involves the use of many different tools. This in turn necessitates (1) manual bookkeeping, and (2) data translation. What the data analyst needs, and what the current tools do not provide are support for flexible data segmentation, support for keeping track of a sequence of operations, support for reuse of work, and enforced semantics between operations and data (and thus between sequences of operations). The analyst further needs support for translation of data between file formats, support for capturing relationships between files, support for recovery from errors made earlier in a session, and support for window management. It is an object of the techniques described in the following to overcome these and other problems of the environments presently available for doing data exploration and analysis and thereby to provide an improved system for doing that work.