Field of the Embodiments
Embodiments of the invention relate to the field of large-scale, multi-source data sharing. More particularly, embodiments of the invention relate to an infrastructure for facilitating data sharing and analysis across disparate data sources in a secure, distributed process.
Description of Existing Art and Identification of Technical Problem
Our world is a complex place. It is faced with many difficult, interdependent problems; some caused by nature, others caused by man. As individuals, families, communities, and nations, we face an ever changing and compounding series of perplexing challenges spanning numerous domains: defense, health, climate, food, cyber, energy, transportation, education, weather, the economy. Compounding pressures in each of these areas threaten our health, our safety, our security, our livelihood, and our sustainability. We seek improved capabilities to detect, understand, mitigate, and prevent our brave new world of threats. To address these challenges, we invariably resort to science, our systematic enterprise for building and organizing knowledge that helps us understand, explain, and predict our world around us. At the core of science is inquiry. We formulate questions. We generate hypotheses. We predict consequences. We experiment. We analyze. We evaluate. We repeat. Our problems are complex; the process is slow.
Fueling the scientific process are the observations we make and the data we collect. With the advent of the 21st century telecommunications explosion, data is now flowing and evolving all around us in massive volumes, with countless new streams, mixing and shifting each minute. This data space is enormous and continuously changing. And by many accounts, its expansion and movement has only just begun. Analyzing and understanding this vast new ocean of data is now of paramount importance to addressing many of the complex challenges facing our world.
Today's data analytic industry is vibrant with a continuous supply of new and innovative products, services, and techniques that thrive and prosper based on their relative merits in the respective marketplaces. Unfortunately, these components are rarely interoperable at any appreciable scale. Moreover, the rapid proliferation of analytic tools has further compounded the problem. With only loose coordination, these partial solutions are ineffective at combating the broad spectrum of problems. Attempting to impose a “one-size-fits-all” analytic solution, however, across today's tremendous data expanse poses significant scientific, technical, social, political, and economic concerns. Consequently, an enormous amount of resources must regularly be expended to address isolated issues and mitigate specific threats. Thus, the analytic community faces considerable challenges dealing with major classes of problems—particularly those at national and international levels.
Specifically, data and analytics collaborators often adopt unique trust relationships with data source owners, evolve unique analytic approaches, use a variety of visualization systems, and leverage a diversity of analytic platforms and tools. Managing a shared knowledge space that is centrally located requires all transactions between these items to flow into a single site and then flow back out, creating a bottleneck and a single-point of failure. The loss or deterioration of the central point's resources implies the loss or deterioration of the entire knowledge space. Replicating the knowledge space with multiple sites serving as mirrors and/or backups leads to unnecessary duplication, complex interfaces, large data movement, and complicated synchronization, privacy, and security policies. Institutionally, such alternatives invariably require organizations to commit to a structure over which some may have little control, whilst placing greater operational burden, responsibility, and control on others. Balancing all these factors invariably leads to difficult negotiations involving data ownership, knowledge curation, organizational autonomy, and research independence. Accommodating the continuous flood of new and ever-changing data, theories, and interpretations also requires a dynamic knowledge space, further challenging a centralized design.
Accordingly, there is a need for a solution that addresses numerous issues standing in the way of sharing data for analytics on a global scale. These issues include, for example: the massive logistics problem with attempting to integrate thousands of government/non-government data systems at scale when the systems have different standards, models, security, infrastructure, procedures, policies, networks, access, compartments, applications, tools, protocols, and the like; the increased security risk that follows large-scale integration of data resources; the lack of analytic algorithm techniques to automatically detect data patterns and provide alerts, i.e., the means to transition from “analytic dumpster diving” to early-warning indication and real-time notification; and the privacy tensions between security and liberty.