The present invention relates to the field of computing. More particularly, it relates to a method for performing analysis of data.
The field of database management systems is well known and understood. The increasing size and usage of databases has lead to new problems. In possibly large, possibly sparse cross-tabulations, commonly referred to as cross-tabs, computed from data contained within a database, it is always a challenge to present the end-user with that subset of data with the highest semantic content obtainable. This is done by omitting or consolidating redundant or unimportant information by deleting or aggregating specified rows and/or columns in the cross-tabulation so that only the more useful data is shown.
In the past, various techniques have been used to reduce the amount of information presented to the end-user. These techniques include simple ploys such as removing complete rows or columns of data containing only zero data or no data at all. Other more complex techniques have also been used to aggregate data so that the end-user is not overwhelmed by the sheer quantity of values, many of them sufficiently small that they can be discounted when assessing the overall picture. An example of this approach is the use of the so-called xe2x80x9cPareto rulexe2x80x9d, which, in summary, postulates that in many sets of data, a large proportion of the data values are small and uninteresting, and thus may with advantage be grouped or xe2x80x98rolled upxe2x80x99 into a pseudo-category named, typically, xe2x80x9cOtherxe2x80x9d. (This xe2x80x98rulexe2x80x99 is named for Vilfredo Pareto, an Italian economist and sociologist of the early 1900s. It is based on the unequal distribution of things in the universe, and paraphrased states that xe2x80x9c80% of wealth is in the hands of 20% of the populationxe2x80x9d).
Other similar functions, herein referred to as filter expressions, have been adopted to assist the end-user in interpreting data. Each such function generally requires significant ad hoc development, and frequently not all of them have been available to the end-users of any one system. Because of the ad hoc nature of previous implementations, there have been few opportunities to take advantage of the processing savings available if several functions are evaluated at once.
Typical end-users of large databases are not always familiar with or even aware of, these functions, and in consequence are not able to manipulate the data effectively.
What is needed is an improvement in the usability of databases and in efficiency of processing the data.
According to the invention, there is provided a computer-based method of evaluating data by selecting the highest semantic content of a table of data, comprising the steps of: constructing a cross-tabulation of data from one or more databases; ascertaining filter expressions to be applied to said cross-tabulation of data; evaluating said filter expressions using said cross-tabulation of data; and storing the results of said evaluation in a status table. In this context, semantic content is defined as that the informational content that is most meaningful or significant within the table, particularly for the current user.
The present invention introduces a mechanism to overcome the limitations of the existing methods for analyzing large amounts of data, which improves their usability as well as efficiency of processing. Examination of the problem led to the realization that there is indeed a relatively small number of possible filter expressions, or rather, filter expression types, which are useful in the context of manipulating cross-tabulations, particularly, but by no means limited to, large sparse ones. Typically, the end-user is presented with a list from which to select the various filter expression types used to condense the data, and this allows interactive selection of the most interesting ones.
The invention is a computer-based method of evaluating data by selecting the highest semantic content of a table of data. This is achieved, in one embodiment, by constructing a cross-tabulation of data from one or more databases, then ascertaining which filter expressions are to be applied to the data. The results of the evaluations of the filtering expressions using the data are then stored in a status table for later use.
In addition to permitting the end-user to select the filter expressions to be applied to the cross-tabulations, the invention allows the end-user to choose parameters or arguments for those filter expressions requiring them. All of the selected filter expressions can then be used as xe2x80x98filtersxe2x80x99 of the data in the cross-tabulations, so that the end-user gains a better grasp of the significant attributes of the data by having less important data either omitted, or aggregated into arbitrary groups.
An additional benefit of the invention is that faster presentation (or evaluation and display) of the results occurs because all of the permitted filter expressions can be pre-computed. This faster presentation leads to significant improvements in usability and effectiveness.
It can also be seen that in geographically dispersed systems, any reduction of the quantity of data presented to the end-user provides the additional benefit of lessening the system resources required to transfer that data between locations.
In some instances, the invention involves the pre computation of quite complex functions. Although this can be expensive in processing time, the improvements in end-user results and presentation speed, as well as the resultant savings in data transfer volumes, often outweigh this cost. Pre-computation is especially beneficial when it is anticipated that the table will be used multiple times, which is more likely where the end-user is analyzing the data interactively. Further, when performed concurrently, the computation cost for several parameters used in selected functions does not increase linearly with the number of parameters, but rather each is a relatively small incremental cost. Overall, concurrent pre-computing of multiple filters has the potential for significant savings in processing resources.
Although the primary benefit of the invention is to improve computational efficiency and provide enhanced end-user functionality, there are further benefits for client/server and similar network-based environments. The invention permits, indeed encourages, the end-user to make decisions which ultimately reduce the amount of data required to be transmitted across the network.
The environment in which the present invention is used is that of a general purpose computing facility connected with a number of databases. It is typically used by a number of simultaneous end-users, although that aspect is not relevant to the operation of the invention. The computing facility may comprise a number of interconnected computers, and the databases and users may be co-located or remotely located. Interconnection of these elements, whether or not co-located, might be over a network such as the Internet.
Other aspects of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying drawings.