A typical relational database is organized into structured tables. The tables have records stored in each row, and the fields associated with each record are organized into columns. Users and applications retrieve information from the tables by making queries to the database that search one or more tables for content that meets certain criteria. These queries may be drafted to look for trends in data that explain or predict the occurrence of a particular phenomenon.
For example, assume a banking executive creates a low risk lending package for car buyers, but the car buyers simply are not purchasing the lending package. A marketing analyst may consult a sales database to determine how to increase sales. The marketing analyst could query a sales database containing similar lending packages over the course of previous years to determine trends in packages that sell well. An example query that may be issued for this purpose may have the following form in SQL:
SELECT sales, dates, risk FROM lendingPKG_sales_table ORDER BY date;
After sorting the sales by date, the marketing analyst may notice that high risk packages sell best during the summer months and low risk packages sell best during the winter months. Based on this trend, the marketing analyst may report that lending packages should be tailored to sell to car buyers for a particular season.
Arriving at a hypothesis that correlates risk with the time of year requires creating a query to a specific table in a database having the required fields to test this correlation. In this example, the records containing “sales”, “date”, and “risk” are pulled from data already organized into columns labeled “sales”, “date”, and “risk”.
Unfortunately, a single database containing these fields may not exist. When the necessary databases do not already exist, a data analyst may attempt to ascertain trends from large quantities of data, referred to as “big data” without having the data organized into a single table that may be queried to readily show trends in the data.
Big data may comprise thousands or even hundreds of thousands of files that are organized into different data structures. Navigating this data may be difficult due to the size of the data (terabytes to petabytes) and the heterogeneous nature of the files. A dataset consisting of big data may comprise numerous files with many different key-value pairs. For example, consider the following file.
{{Name:Jon, Date: Sep. 1, 2014}
{Name:Ben, Date: Sep. 2, 2014}
{Name:Erin, Date: Sep. 3, 2014, Phone: 555-1234}
Here, there are three records, with two key-value pairs in each record. Specifically, there are two records with key-value pairs for the keys “name” and “date”, while the third record contains key-value pairs for the keys “name”, “date”, and “phone”.
In a typical database, tables are stored on disk and portions of each table are loaded into volatile memory in order to respond to queries. The speed at which a given database server is able to answer a query is based, at least in part, on how long it takes to load the necessary rows into volatile memory. The speed of responding to a query may be improved by indexing a table first based on a column, and then reading the index to determine what rows should be loaded into volatile memory. Because less rows need to be read into volatile memory, the speed of loading the table is improved.
In a clustered database system, multiple “nodes” have access to the same on-disk copy of a database. The speed of responding to a query may be improved by partitioning a database object (index and table), and assigning each partition to a different server. After a particular server reads a particular partitioned index, that particular server loads the corresponding rows from the corresponding table partition. Once loaded into volatile memory, the data items may remain cached in volatile memory so that subsequent accesses to the same data items will not incur the overhead of accessing a disk.
Loading records from “big data” takes a significant amount of time that varies from algorithm to algorithm due to the varying amount of useful data and the varying length of the records being loaded. Once the data is loaded into a cluster, the data may be stored in corresponding caches. However, having servers working on cached data in parallel is less likely to improve performance because the distribution of records across the cluster must change for each database operation. Redistributing the data for each operation usually involves cross-server communication for a more favorable distribution. The metaphorical concept that “data has mass” effectively communicates that transferring large amounts of heterogeneously structured data around a cluster is a slow, inefficient processes.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.