As digitized information grows, methods for handling large amounts of data are required to effectively manage enormous datasets. So-called “big data” has become a term referring to datasets so large or complex that traditional data processing applications are inadequate. Problems arise when attempting to manage big data, including how to effectively store this big data, how to quickly retrieve the big data, and how to easily manipulate this big data. Other problems include searching the data, transferring data, analyzing the data, visualizing the data, and/or updating the data.
Often limited by hardware restrictions, data scientists and engineers are required to generate new methods for big data management. One potential solution includes using massively parallel processing (MPP). MPP refers to using a large number of processors or separate computers to perform a set of coordinated computations in parallel or simultaneously. MPP databases may also be used to process and store data by dividing big data into chunks manageable by each of the separate processors. An example of this distributed processing and storage is the Apache Hadoop® framework, which utilizes computer clusters formed from multiple pieces of commodity hardware. Apache Hive® and Spark® are also frameworks useful for integrating computer clusters.
While MPP databases have provided some improvements to tackling problems with big data, the solution is not perfect. Processing queries often requires the execution of a complete query before a result is returned. In the big data context, waiting for the completion of an entire query leads to an increase in latency between query execution and result delivery. Further, visualization and access to stored data is often difficult when attempting to manage big data as latency often prevents real time visualization.
Another problem with current systems is that queries are often sent to MPP databases as plain text queries, such as, for example, text strings. This type of query requires MPP databases to parse the queries and generate a plan for fetching results. Based on this configuration, the latency time between the receipt of a query and result generation is increased because multiple processors must communicate to determine how to parse the query.
As yet another problem, systems often fail to accommodate different contexts for data distribution, often only working for a single context. For example, functionality of the system may be limited by the query functions available to the processors of the MPP database. This single context configuration limits the types of queries and limits the functional capabilities of the MPP database.
In the drawings, like reference numbers generally indicate identical or similar elements. Additionally, generally, the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears.