Although companies store data in a relational database, many data sources are not stored in a relational format, such as flat files, Portable Document Format (PDF) files, mail, Lightweight Directory Access Protocol (LDAP) stores, social media posts, Hypertext Transfer Protocol (HTTP) documents, Excel among others. In addition, as systems become more efficient and need to integrate in an ever-increasing complex ecosystem of devices and hosting models, new unstructured storage formats are being created over time, such as Extensible Markup Language (XML), and more recently JavaScript Object Notation (JSON), adding to an already crowded set of existing formats. Furthermore, some data sources are not available directly, but through the use of an application programming interface (API), a web service, a representational state transfer (REST) service or other protocols that return unstructured data, such as weather information, twitter feeds, financial stock market data and so on. In addition, other systems provide data sources not easily consumable such as electromechanical devices, sensors and instrumentation components.
It can be very cumbersome, and sometimes impossible, for data analysts, business users and other systems to consume the disparate data sources in their native formats because they require specialized skill sets and/or hardware to consume each source of data. In addition to differences in the data source format, the content itself of a document is constructed by data source providers according to their needs and specifications; for example an XML document describing traffic conditions will be structurally different than a web service listing zip codes, which will be very different than an API accessing thermostats in a building. As a result, because these data sources are disparate, and most of them unstructured, it is not possible to uniformly query them directly in a manner that relational data sources can.
To solve this problem, developers typically use an intermediate relational database management system (RDBMS) to host the disparate data sources, create middleware, and then run batch or near-time integration tools to import the various data sets into the database, and create views or other database objects to query the data using Structured Query Language (SQL). While this solution is widely accepted today, it is vastly inefficient because it typically requires an intermediate storage engine, and cannot be used to query real-time sources of data since the information often needs to be stored first in an intermediate database before being queried.
As such, a need exists for methods and systems by which disparate data sources can look like relational data sources so that they can be queried uniformly in real-time without the need for an intermediate RDBMS.