Software applications generate and require use of vast amounts of data that is stored in one or more back-end systems that allow data manipulation and management. For example, data can be stored in relational database systems that contain a plurality of tables, each having a plurality of columns. Data access or retrieval can be accomplished through issuance of searches and/or queries. Queries can be issued by user(s) and/or software application(s) and can be designed to retrieve data that may be requested for various reasons, such as, analysis, operations, etc. Queries can be written using a variety of computing languages and/or environments. One example of such a computing language is Structured Query Language (“SQL”). SQL includes a data definition language and a data manipulation language. SQL can further include data insert, query, update and delete, schema creation and modification, and data access control features. SQL queries implement a declarative SELECT statement, which retrieves data from one or more stored tables, or expressions. SQL queries allow the user to describe the desired data, in response to which a database management system (“DBMS”) performs planning, optimizing, and performing the physical operations necessary to produce a resulting dataset.
Some databases that store data are implemented using a NoSQL or “Not Only SQL” methodology, which includes highly optimized key-value stores intended for simple retrieval and append operations, thereby providing significant performance benefits in terms of latency and throughput. NoSQL databases provide an ability to flexibly load, store, and access data without having to define a schema ahead of time. By removing this up-front data management effort, developers can more quickly get their application up and running, without having to worry in advance about which attributes will exist in their datasets, or the domains, types, and dependencies of those attributes. The proliferation of NoSQL databases has resulted in an increasingly large amount of production data being represented as incompletely structured data, for example using key-value and Java Script Object Notation, or “JSON,” data models, instead of traditional relational models. Existing methods of analyzing such data are suboptimal.
Some NoSQL databases support primitives that enable the stored data to be analyzed. However, these primitives are not fully compliant with the SQL standard, rendering a large amount of third party analysis and business intelligence tools incompatible and unable to help with the analysis. Other NoSQL databases connect to Hadoop which enables Hadoop MapReduce and other execution frameworks within Hadoop to analyze data. However, connecting to Hadoop and enabling analysis via various projects in the Hadoop ecosystem has its own shortcomings. Either these projects provide non-SQL interfaces that have the same compatibility and skill-set shortcomings as the NoSQL primitives, or, if they do provide a SQL interface, they require the user to create a schema before the data can be analyzed via SQL, thereby eliminating a principle reason why the NoSQL database was used in the first place.
Accordingly, a need exists for a data analysis system that enables analytical queries implemented through structured query languages (such as SQL) to be issued over incompletely structured data (such as key-value or JSON data) without first having to define a schema.