A typical relational database is organized into structured tables. The tables have records stored in each row, and the fields associated with each record are organized into columns. Users and applications retrieve information from the tables by making queries to the database that search one or more tables for content that meets certain criteria. These queries may be drafted to look for trends in data that explain or predict the occurrence of a particular phenomenon.
For example, assume a banking executive creates a low risk lending package for car buyers, but the car buyers simply are not purchasing the lending package. A marketing analyst may consult a sales database to determine how to increase sales. The marketing analyst could query a sales database containing similar lending packages over the course of previous years to determine trends in packages that sell well. An example query that may be issued for this purpose may have the following form in SQL:
SELECT sales, dates, risk FROM lendingPKG_sales_table ORDER BY date;
After sorting the sales by date, the marketing analyst may notice that high risk packages sell best during the summer months and low risk packages sell best during the winter months. Based on this trend, the marketing analyst may report that lending packages should be tailored to sell to car buyers for a particular season.
Arriving at a hypothesis that correlates risk with the time of year requires creating a query to a specific table in a database having the required fields to test this correlation. In this example, the records containing “sales”, “date”, and “risk” are pulled from data already organized into columns labeled “sales”, “date”, and “risk”.
Unfortunately, a single database containing these fields may not exist. When the necessary databases do not already exist, a data analyst may attempt to ascertain trends from large quantities of data, referred to as “big data” without having the data organized into a single table that may be queried to readily show trends in the data.
Big data may comprise thousands or even hundreds of thousands of files that are organized into different data structures. Navigating this data may be difficult due to the size of the data (terabytes to petabytes) and the heterogeneous nature of the files. A dataset consisting of big data may comprise numerous files with many different key-value pairs. For example, consider the following file.
{{Name:Jon, Date:Sep-1-2014}{Name:Ben, Date:Sep-2-2014}{Name:Erin, Date:Sep-3-2014, Phone:555-1234}
Here, there are three records, with two key-value pairs in each record. Specifically, there are two records with key-value pairs for the keys “name” and “date”, while the third record contains key-value pairs for the keys “name”, “date”, and “phone”.
Typically, data scientists perform a shot-in-the-dark approach to analyzing big data in hopes of finding useful trends. In the shot-in-the-dark approach, an analyst asks a data scientist or a programmer to scan for a particular type of data. The data scientist or programmer then writes a program to search the big data dataset to find that particular type of data. When the results are not useful, the data scientist asks for some other type of data until something useful is found. This time consuming process may easily go back and forth for weeks, with the data analyst trying to ascertain useful data from un-useful data. Additionally, each iteration of programming requires a new efficient algorithm that can be performed against the big data. The abilities of the data analyst may limit the data scientist in his or her approach to solving the problem.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.