The term “big data” as it is used today is a reference to voluminous and complex data sets as well as the challenges associated with analyzing and processing these large data sets. The notion of “big data” is a relatively recent phenomenon that is in part a result of the reduced costs of storage and in part a result of proliferation of data gathering technologies and techniques in virtually every sector of business and technology. “Big data” may be structured, unstructured, or partially structured and may be distributed across a network.
As a consequence of the vastness and relative loose boundaries of what defines “big data,” the possession of “big data” is nonetheless problematic because the data tends to be difficult to parse for the purposes of identifying valuable information embedded therein and extracting that valuable information from the data while distinguishing noise or other misleading information in the data set.
Big data is also problematic because it does not necessarily contain information or knowledge. In other words, given a large data set, it is often difficult to ascertain whether an apparent input and an apparent output are related in any way. Big data can also be noisy and misleading simply due to the vastness of the data set, meaning that it is often difficult to filter noise and misleading data to extract usable knowledge. Such large data sets can also prove to be a barrier for swift decision making given their cumbersome nature. Big data can also contain confirmations of existing knowledge (e.g., that adults are living longer on average in 2014 than they did in 1914) rather than providing new knowledge and models. Finally, given that searching for information in big data sets can be like looking for a needle in a haystack, the value of finding such “needles” may be outweighed by the cost of locating it.
Despite these identified problems, data is fundamentally the only source of knowledge in the world. Humans learn from observing the occurrence of actual events, experimentation, and the like. Most, if not all, scientific fields are based on knowledge obtained from collecting and analyzing data. Even in situations where functions or models are developed to describe a phenomenon, the scientific process usually involves collecting data and fitting a function or model to the data until an adequate degree of accuracy is achieved.
Traditional modeling and machine learning paradigms have proven to be insufficient and ineffective for handling big data. Specifically, traditional modeling paradigms are highly dependent on the erroneous assumption that input-output or functional relationships must exist to extract meaningful knowledge from big data sets. Accordingly, traditional modeling and machine learning paradigms have sought to develop functions or models that describe the relationship between outputs and inputs, with “success” being judged based on how much of the actually observed data is described by the function or model. As such, much of the research related to big data heretofore has been focused on understanding and attempting to derive functions or other easily-expressible input/output relationships based on already-stored data to predict behavior or to otherwise analyze the data set to extract useful relationships.
One problem with this traditional modeling approach is that there may be hidden layers or layers of abstraction to explain why a particular input layer leads to a particular output layer. In a simplified example, a function to describe the time to heat water by a certain number of degrees may be derived from data obtained at sea level, and because of hidden layers (i.e., the impact of thinner atmosphere at higher elevations), the model may be ineffective to describe observed data at mountainous elevations. While these hidden layers are readily seen in this example, they can be much more numerous and difficult to identify in more complex input/output relationships, such as relationships involving the financial industry.
As alluded to above, the traditional approach also typically assumes a functional form for the relationship between inputs and outputs. This functional form is usually assumed to be a linear model, such as ARMA, GARCH, and the like, although in more exotic situations, nonlinear models including logistic models and neural networks are assumed. One problem with this functional assumption is that searched-for functions tend to be time invariant and take the same form at different times. Similarly, these functions tend to be event invariant in that the functions take the same form regardless of underlying events that are occurring. This byproduct of the assumption that relationships are functional fails to take into account that the world is often a discrete, discontinuous entity where time and event changes mean that the outputs corresponding to a set of inputs can be discontinuously changing depending on the time and/or underlying events. In other words, a fixed continuous relationship may not exist in many real-world systems such as in the financial market and economic system. The assumption that functions define the relationship between inputs and outputs and the determination of subsequent models/functions are a necessary tool for small data sets because significant interpolations and extrapolations are necessary with small data. It also works well for a classic physical system where fixed continuous relationships do exist (e.g., a set of 1000 drops of an object from varying heights used to generate the equation Force=mass*acceleration). However, “big data” makes determining a functional description unnecessary because fewer interpolations and extrapolations are necessary. Moreover, observed data samples themselves can be dense enough to represent the input and output relationship as described in this disclosure. Also, for many applications other than simplified physical systems, at best, it proves significantly difficult to find an accurate functional relationship between inputs and outputs, and more often, such simple function relationships just do not exist (e.g., where thousands of different objects from extremely small to extremely large are dropped in different fluids, ranging from air to water to a vacuum, with drag playing a varying role depending on the object and the fluid density).
For the reasons described above, traditional modeling approaches have been most successful in determining physical laws, such as Newton's laws, Kepler's laws of planetary motion, and the like. Given that objects have continuous motion, interpolation and extrapolation can be readily performed without the need for substantial amounts of data. In more complex settings, such as with financial data, it is possible that different functions can describe the same set of data. It is difficult or impossible to determine which function is the “right” function to describe the data.
Moreover, the uncertainty resulting from trying to match a function to a data set in traditional approaches takes a probability function form, such as a normal distribution, making the derived function less useful. In the past this notion has been described as an “error term”, whereby the relationship between an input and an output is defined by some function plus some uncertainty random variable, such as a Gaussian random noise. However, the problems discussed above in generating the relationship function are exacerbated by attempting to determine an uncertainty using a random variable with a predefined form of distribution, making traditional approaches even more unwieldy with large sets of data.
Traditional approaches also generally fail to appreciate that the most useful output to users of systems that provide big data analysis capabilities is often a distribution of possible outcomes, as opposed to a single “best” outcome. That is, existing systems seek to define a function or group of functions to calculate a “best” outcome rather than providing a user with a distribution of outcomes that indicates several possible outcomes and a likelihood or incident of each of the possible outcomes. This effort ignores the fact that often a distribution of potential outcomes is useful to users to indicate a series of possible outcomes, along with an indication of how likely the outcome is to occur.
The emphasis on functional or relational forms is the result of the erroneous assumption that there is necessarily a single or a small number of functional relationships between inputs and outputs. Furthermore, with little idea of which function is proper under different circumstances, the traditional modeling of big data produces far too many outliers or exceptions to the rule defined by the function. The system disclosed herein recognizes that this approach may be ill-suited to process “big data” sets that are becoming more and more readily available, and presents a paradigm shift in terms of the way big data analysis is approached.