A data explosion has taken place over the past few decades. In almost every industry and field of endeavor, data and metadata have been generated, organized and stored to a degree never before seen. Satellite images are now publicly available from multiple sources for the entire globe. Weather and climatological data are likewise available on a worldwide basis. Other information, such as population density and demographics, electrical grid capacities, water resources, geopolitical records and the like is available via both free and paid services, whether from the public sector or the private sector.
While the availability of such data permits creation of services and systems never before possible, it also poses a tremendous technical challenge. With so much available data, analysis is now often limited by the ability of computing systems to quickly access to and process such data from any desired location. This is particularly true where there is an interest in using data from a variety of sources for a variety of applications. Each data source may be organized in its own way, coded uniquely and accessible in a manner different from other sources. Each application may be expecting to get information in a particular way that may well not match with data sources that might be useful for the application.
To provide just one example, consider the data that may be available regarding a particular piece of real property, such as a suburban shopping center. Municipal data identifying the property based on its tax parcel identifier and perhaps on corresponding utility counts may be available. Street addresses for the businesses in the shopping center are also available and those are typically geocoded for various cartographic uses. The property is also likewise identified by its geographical coordinates, what zoning district it lies in, whether it is inside or outside a particular flood zone boundary, its distance from utilities such as fiber optic distribution facilities, and the like. Each of these items of data is provided in a form determined for its typical use, for instance tax maps are designed for use in administering municipal tax systems while flood maps are designed for use in federal emergency planning. And notably, some types of properties are not well identified using approaches that work well for other types of properties. For example, while it makes sense to provide an address of a commercial building, such an approach cannot be used very well for a rail facility that measures only tens of feet in one dimension but hundreds of miles in another dimension. Likewise, utility transmission lines, gas and oil pipelines and such are continuously distributed throughout their geographic range, and in any event often do not have conventional physical addresses corresponding to the locations of their component parts. Thus, in many instances there is no particular compatibility or correlation among numerous data sources that all characterize a particular piece of real property or other type of asset.
Numerous attempts have been made to address such challenges. For example, relational databases provide a level of flexibility that has made them quite popular, but sometimes suffer from performance bottlenecks. Columnar database systems have enhanced performance but do not always lend themselves to common system requirements (where data is typically row-based) and thus often require ancillary processing to make them most helpful. In-memory database systems can improve on performance compared to both relational and columnar databases, but are costly to implement when used for large data sets. Data grid architectures have also been developed over the past two decades or so to handle extremely large data sets, such as those used in particle physics research (e.g., at CERN in Switzerland) and in climate modeling. While such data grids scale extremely well without performance losses seen in older architectures, to date they lack flexibility in managing widely disparate types of data without significant preprocessing.
In one application, the field of probabilistic modeling can involve simulation of various simulation periods, events or scenarios against a given set of items to be affected by the events, generating a series of potential outcomes for each event. Such models can be probability based (i.e., generating a probability factor of the likelihood of each event's effects on a given item) or period-based (i.e., generating a set of simulated time periods, each containing a series of events and a likelihood of their impacts within a given event period simulation). Further discussion of related issues is provided in copending, commonly owned U.S. patent application Ser. No. 13/799,120 filed Mar. 13, 2013, published as US 2014/0278306, the contents of which is hereby incorporated by reference as if fully set forth herein. Such modeling can be applied to various fields and industry sectors, including agriculture (risks to farmland and agricultural products); supply chain (possible interruptions impacting certain vendors or procurement items); insurance (relating for instance to real property, people, or contents of buildings/containers); protection of governmental/municipal facilities (airports against weather or terrorism, dams and other flood defenses against storm water), energy (oil platforms, tankers and pipelines subject to leaks/spills); healthcare (ranging from disease prediction to analysis of possible pandemic/epidemic threats to personalized medicine procedure outcome prediction); and heavy industry (factory disruption based on events ranging from work stoppages to worker health to catastrophic events).
For each of these scenarios, probabilistic modeling provides a mechanism to generate large series of potential outcomes. The corresponding data generated is very large, with potentially billions or trillions of results from a single model run. The potential data sizes for such operations can be readily understood by considering that data cardinality in such situations may be driven by the product of the number of items of interest, the number of simulation periods, the number of events per period, and the average item hit rate per item.
Such processing as typically been performed in a “map/reduce” manner. In many types of processing, the task is to process a large amount of input data and reduce it to a small number of “answers” (often only one). With probabilistic modeling, however, the output size is not typically reduced, since the model outputs are intended to be thereafter available for further analysis to address a wide range of questions that may be posed by those interested in one particular aspect of the larger field being modeled.
Still further, in many applications various items being considered may themselves carry their own sets of attributes. For example, a machine or other heavy asset may have its own metadata regarding its age, time until next service and the like; a facility may likewise have attributes or constraints (an airport may be subject to a noise curfew of 11 p.m. local time for outbound flights and midnight for inbound flights). Thus, metadata for items considered, in order to be available for future analysis, adds complexity to such modeling. Data structures for an item may be expressed as a nested hierarchy of a large number of attributes that may make use of a traditional relational database structure difficult or nearly impossible to use. Further discussion of such complexity is provided, for example, in copending commonly owned U.S. patent application Ser. No. 13/914,774, filed Jun. 11, 2013 and published as US 2013/0332474 and PCT application PCT/US2015-022776, filed Mar. 26, 2015, the contents of which are hereby incorporated by reference as if fully set forth herein
The resulting data analysis challenges include, for example, items that are characterized along hundreds if not thousands of dimensions, many of which are nested or otherwise arranged in non-trivial relationships; such information, referred herein as dimension data, is used as both the subject of the model, as well as providing meaning and interpretation to the post-processing output for probabilistic models. It may not always be beneficial to simplify this robustly characterized information because future processing may call upon the various relationships in manners not known in advance. Further, each specific probabilistic model typically requires specialized information to perform its simulations. Simulating a railroad line vs. a catastrophe that could affect municipal infrastructure would be highly differentiated models, each requiring specialized input data about the items under simulation. Therefore, in addition to the input data size being extremely large, the output data size of interest may remain large, on the terabyte-scale for even a single modeling exercise. In addition to the data size, the types of calculations on the data for such modeling are often not simple, but can include many complex non-additive metrics. The combination of data structure, data size, and processing type issues makes it quite difficult to contemplate use of conventional computing architectures for such processing.
Conventional model computation constructs a given model as a single large function and is known as “monolithic model implementation.” Typically, this is implemented using a primary programmatic loop over repeating elements, represented in pseudo-code as:
for each Eventfor each Itemmodel simulation computation
This approach is referred to as event/item loop processing. For each iteration of the loop, the model simulation needs to be computed, which can quickly result in data and computational “explosions” as the number of events and items increases. The model itself can be scaled across multiple computational devices, however the complex nature of the dimension data, with its inherent data relationships, makes scaling this type of information very difficult. Some attempts have been made to use columnar databases in such applications, but these have their own scaling issues and columnar databases are typically not well-suited for supporting complex object types.
Rather than using computing systems with typical, known architectures, is would be desirable if a computing system with an architecture that optimizes processing of such large, complex and disparate data sets, through a wide variety of applications, were available.
The features and advantages described in the specification are not all inclusive and, in particular, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings and text herein. Moreover, it should be noted that the language used in the text has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter.