The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
Computers operating in the security industry often require processing large data sets from a variety of sources. These sets can include machine scanning results, network traffic surveillance data, logs from intrusion detection systems, or machine generated test cases. Computer systems can offer value from the ability to work on these data sets without any arbitrary constraints and to apply correlations to previously undiscovered patterns based on emerging requirements. However, as an industry grows, so does the diversity and size of the data sets, and this growth causes scaling problems for single-host or unstructured multi-host solutions. A particular problem is how to integrate a completely new kind of data set into a security system or security analytics system that was not originally programmed to process that kind of data set.
Distributed computation is often employed as a means of managing large data sets. However, as the complexity of the computation grows, the systems must become more specialized in the domain of their application. This is a direct consequence of tradeoffs made in order to arrive at a usable state. Therefore, moving generic frameworks and systems for distributed computation towards customized end-user applications is desirable.
Map-reduce (MR) is a well-known and industry standard distributed computing paradigm. The nature of the MR model enables general purpose computations over distributed computing nodes. MR is a two-stage distributed computing model, wherein in the first stage (also known as the “map” stage), a transformation is applied to each item of the input data set. Subsequently, in the second stage (also known as the “reduce” stage), the output of the previous stage is reduced to a smaller set. The second stage can be applied iteratively until a desired result is reached. This model naturally manifests in distributed systems because the first stage captures the need for applications to be independent of each other and the second stage can be distributed based on data-derived keys. Open source data-driven scripting languages such as Apache PIG utilize the MR model to provide the user with the ability to perform general purpose computation on data sets over a set of distributed nodes.
The logical components that process the data using the MR model are called “analytics.” Each analytic consumes an input data set to produce an output data set. Thus, while MR provides significant operational guidance in principle, a substantial intermediate processing is required to setup data delivery pipelines, ensuring data consistency, and storing and curating intermediate and final result data sets. Moreover, because data is typically received from a variety of sources and different sources may have different structuring, it can be difficult to effectively correlate the data.
Thus, effective security analysis requires collating data from a variety of sources. A single database is not necessarily sufficient to structure the data from disparate but relatable sources. One of the causes for such an occurrence is that most databases are ill-suited to address the unforeseen scenario of structuring data from different sources that must be correlated based on emerging requirements. Therefore, a new form of data management environment is needed that enables arbitrary structuring of data.
Furthermore, due to the MR computing model's prominence, flexibility, usability, and applicability, several commercial off-the-shelf implementations are available such as Amazon Elastic Map Reduce or EMR or Google Cloud Services. EMR and other similar implementations give end-users the ability to purchase compute time in managed offsite data-centers. However, the end-user must be creative to carefully apply these offerings to solve their specific problems.