The present invention relates to data analysis and, more particularly, to situations in which a “conservation law” exists between related quantities.
Given a pair of numerical data sequences—call them a and b—and an ordered attribute t such as time, the quantities obey a conservation law when that the sum of the values in a up to T equals the sum of the values in b up to T, for all t=T. Thus, current flowing into and out of a circuit node obeys a conservation law because the amount of current flowing into the node equals the amount of current flowing out.
Thus in accordance with the present invention, we have recognized that even when two numerical sequences do not strictly obey a conservation law, it can be useful to carry out a conservation-law-based analysis of such data and, more specifically, to generate data indicating the degree to which the conservation law is or is not obeyed or satisfied. Examples of such pairs of sequences are a) the number of inbound packets at an IP router versus the number of outbound packets; b) the number of persons entering a building versus the number of persons leaving; and c) the dollar value of the charges run up by a group of credit card holders versus the payments made by those credit card holders
We have thus recognized that a conservation law is often an abstract ideal against which real data may be calibrated. Short-term deviations naturally occur from delays and measurement inaccuracies (credit card holders carry a balance; two people who enter a building at the same time may leave at different times, packets entering an IP router may be buffered therein and may thus encounter delays between the router input and output ports, etc.). Major violations may be caused by unusual phenomena or data quality problems; e.g., the reported number of people entering a building may be consistently higher than the number of people leaving (via the front door) if there exists an unmonitored side exit.
We have thus further recognized that it would be useful to discover for which portions, or “subsets,” of the data a conservation law approximately holds (or fails) and to summarize them in a semantically meaningful way. Such a summarization allows one to, for example, quantify the extent of missing or delayed events that the data represents.
To this end, and in accordance with an aspect of the invention, we define what we call a conservation dependency. A conservation dependency is, in essence, an underlying conservation law coupled with a tableau. The tableau provides information about the degree which particular subsets of the data do or do not satisfy the underlying conservation law, the former being referred to as a “hold tableau” and the latter being referred to as a “fail tableau.” Specifically, a hold tableau identifies subsets of the data (e.g., ranges of time for a time-ordered sequence of data) that satisfy the underlying conservation law to at least a specified degree, or “confidence.” By contrast, a “fail” tableau identifies subsets of the data that do not satisfy the underlying conservation law to at least a specified degree. Which type of tableau is the more useful for understanding properties of the data will depend on the nature of the data.
Consider, for example, sequences of credit card charges and payments over time for a given bank. In a general sense, the aggregate amount of charges over time tends to be equal to the aggregate amount of payments because most people pay their bills. Thus the conservation law “charges=payments” is an appropriate model to consider when analyzing such data. However, the aggregate of charges up to any given point in time are going to exceed the aggregate of payments up to that same point in time because of payment grace periods and other factors such as spending habits. Thus in this setting the conservation law “charges=payments” holds only approximately.
That being said, seeing, through a tableau such as that shown FIG. 5, the degree (confidence) to which the conservation law “charges=payments” is or is not satisfied can help the bank understand customer patterns. Specifically, the appearance in a fail tableau of particular time periods (corresponding to subsets of the data) for which the data's confidence is below some level ĉ means that the conservation law is relatively unsatisfied during those time periods, meaning, in turn, that during those time periods the total outstanding balance owed to the bank is relatively high. Thus seeing from the fail tableau that the conservation law “charges=payments” is relatively unsatisfied during the holiday shopping season in November and December would be a confirmation that that is a time when people regularly fall behind on payments. Or seeing from a fail tableau that, for a given confidence level ĉ, there are more time periods in recent years versus less recent years where outstanding balances are high would suggest that people are having an increasing difficult time keeping up with their debts.
More rigorously, assume a data set comprising a pair of numerical sequences a={a1, a2, . . . ai, . . . an} and b={b1, b2, . . . bi . . . bn} with ai, bi≧0 and for which the ith pair of values, (ai, bi), is associated with the ith value, ti, of an ordered attribute t={t1, t2, . . . ti, . . . tn}.
Given such a data set, the invention provides a tableau which comprises one or more subsets of values of the ordered attribute t that meet at least a first specified criterion, that criterion being is that, for at least a specified fraction ŝ of the data set, a confidence measure for the pairs of values associated with each subset in the tableau is a) at least equal to a confidence value ĉ (when we want to provide a hold tableau and b) no more than a confidence value ĉ (when we want to provide a fail tableau). The confidence measure for the pairs of values associated with each subset in the tableau is a measure of the degree to which those pairs of values deviate from an exact conservation law for the data in question.
In illustrative embodiments of the invention, the confidence measure for the pairs of values {(ai, bi), (ai+1, bi+1), . . . , (aj, bj)} for any interval {i, i+1, . . . , j}, is a function of an area between two curves A={A1, A2, . . . Ai, . . . An} from a where A0=0 and Ai=Σj≦Iaj and B={B1, B2, . . . Bi, . . . Bn} from b where B0=0 and Bi=Σj≦Ibj. That area, in particular is the area between a segment of curve A between Ai and Aj and a segment of curve B between Bi and Bj.