Security and privacy of data in Cloud, Big Data, Mobile, and Internet of Things (IoT) workflows benefits from hardening and protection of data and computing resources, both online and at rest. All workflows go through a data lifecycle comprised of these stages: create, store, use (compute), share, archive, and destroy.
Big Data Analytics (BDA) is characterized by collection and storage of very large volumes of data and running analytics on such data stores to answer specific, time-sensitive questions. Data warehousing for Business Intelligence (BI) is a well-known precursor to BDA. For many years, Enterprise has been collecting growing volumes of data in data warehouses, and running analytics for BI. High-performance computing (HPC) systems are another application where one runs very large data sets through to model complex scientific phenomena. However, BDA applications are orders of magnitude larger in scale and differ in the type of data that are to be focused on, which is predominantly unstructured data. HPC and data warehousing typically deal with structured data. Their operations are run on a batch basis, often overnight for data warehousing, and over several days or even weeks for HPC. In contrast, BDA answers questions in real-time or near real-time, which involves storage and compute platforms that can handle very large amounts of data, and which can scale appropriately to keep up with growth. It is also preferred to provide the input/output operations per second (IOPS) desired to meet the query speed requirements. Currently, the largest BDA practitioners address this need using hyperscale computing environments, which are built by provisioning large numbers (hundreds of thousands) of commodity servers with direct-attached storage (DAS). Redundancy is provided for fail over via mirroring. Such environments run Analytics engines and typically have Peripheral Component Interconnect Express (PCIe) flash storage alone in the server or in addition to disk to reduce storage latency to a minimum. These types of environments typically do not use shared storage. However, given the behemoth size, scale and cost of HSEs, it is unlikely that the majority of large to small Enterprises would opt for such hyper-scale platforms. Alternative innovations are desired.
Commodity Cloud platforms can provide the necessary computational power for such BDA applications, but they do not have the desired very large amounts of DAS. Typical modern big data storage systems are often scale-out or clustered network-attached storage (NAS), which provide file access shared storage that can scale out to meet capacity or increased compute requirements. Such NAS uses parallel file systems distributed across many storage nodes which can handle billions of files without performance degradation experienced by ordinary file systems at scale. However, such NAS cannot always be collocated with the Cloud platforms, which represent the computational power. In other words, the Big Data and Big Compute (i.e., Cloud) platforms cannot always be collocated. There is a need for hybrid platform architectures which can facilitate BDA without insisting on collocation of Compute and Big Data stores.
If Cloud (compute) and Big Data stores cannot be collocated, an alternative is that they remain at separate locations, and the analytics (e.g., search) workloads are conducted as distributed processes over the Internet. There are two concerns with this model: concerns regarding the confidentiality of data, as well as its sovereignty and privacy, and concerns regarding the security of data and protection from data breaches. Data tampering, hacking, and illicit information disclosure are major security threats to storing and sharing data assets, such as files, in such environments. For example, healthcare organizations, business associates and subcontractors that support the processing or collecting of protected health information are required to comply with the Healthcare Insurance Portability and Accountability Act (HIPAA). However, unsecured or unencrypted data is commonly found in healthcare data breaches. Compliance is a key driver for the Enterprise customers' need for advanced, complete security and privacy solutions and services on modern, large scale distributed computer systems. Disclosed herein are systems and methods for hardening the security, confidentiality and privacy of data both at rest, for example on computer and data storage systems, and in transit, for example, over a network.