Some distributed storage management systems are used to fetch data from distributed storage devices for user devices requesting the data for one or more applications that require the data for execution. Distributed storage management systems typically include one or more servers that may be used by user devices to request data, one or more servers that may fetch the data from one or more storage devices based on the data requested, and one or more storage devices that may store the requested data. The user devices may request data for the one or more applications, from the one or more storage devices, and may receive the requested data in addition to other data that was not requested for the one or more applications. That is, the one or more storage devices may receive a request from the one or more servers for the requested data, but may return the requested data that may be stored in a location with additional data. Some distributed storage management systems may include storage devices that return the additional data including the relevant data, because the storage devices are unable to differentiate between relevant and irrelevant (additional) data as they do not understand the application data storage format nor have visibility into which data is relevant to any particular application data retrieval request. The bandwidth required to transfer the additional data from the storage devices to the one or more servers is greater than the bandwidth required to transfer the relevant data. Also, the less data gets returned to the requesting application by the storage devices, the less resources (CPU, RAM) will the requesting application consume while processing only the relevant data sent to it by the storage devices.
Data stores in enterprise architectures may be implemented using vendor specific databases and storage infrastructure. The increasing volumes of data and data retention requirements are stretching the costs and capabilities of data stores. For example, financial services companies may be required by law (e.g., Sarbanes-Oxley Act or the Basel Accords) to retain customer (e.g., institutional customers such as Goldman Sachs and retail customers such as an individual banking at retail branch office of Bank of America) data (e.g., financial transactions) for up to a predetermined period of time, often for multiple years. In other examples, telecommunications companies may also be required by law to retain customer data for up to a predetermined period of time as well for investigative purposes. For example, if an attack were to take place on a Federal building, law enforcement may be able to use telephone records to determine who may have committed the act.
Other reasons for increased data retention requirements are general business growth and increased need for understanding customer behavior for staying competitive. For example, thanks to “360-degree view of a customer” requirements in, companies are ingesting and retaining records from ever-increasing number of channels and never deleting data. Manufacturing companies are keeping all records, quality control images and measurements throughout the entire manufacturing process. This allows them to perform more efficient fault detection and provide detailed feedback for quality control, therefore increasing product quality, manufacturing efficiency and reduce maintenance and customer support costs.
Companies look for ways to move some of their data and/or workloads from expensive, proprietary relational database management (RDBMS) and storage area network (SAN) systems to cheaper, open systems such as those made possible by Big Data and Cloud ecosystems. There are several SQL engines for these new “backend” ecosystems (such as Cloudera's Impala and HortonWorks' Hive), although none are currently as mature or as sophisticated as those provided by traditional RDBMS databases. As a result, when a company's volume of data grow they may either re-engineer their existing applications, reporting, and archival systems to exploit new Big Data/Cloud technologies or continue paying fees for RDBMS licenses and SAN storage hardware. Most companies will not re-architect their enterprise systems and all of their applications, data flows, reporting and archiving solutions in favor of using only the Big Data or Cloud solutions. As a result RDBMSs will continue to be a critical part of company enterprise systems. However, some RDBMSs may incorporate new data architectures, as more enterprise data is collected, stored, analyzed and archived centrally in large processing clusters.
As the amount of data collected increases, enterprise systems may begin to offload volumes of data from their application databases to backend Big Data clusters (e.g., data lakes) to be used/consumed by applications that analyze, post-process and distribute the data to applications or Cloud storage for online archival. The enterprise systems will simultaneously need to provide the same data to the applications generating the data. As a result, user devices, and more specifically application databases executing application queries on behalf of the user devices, need to be able to query data from any source database within any application and database (e.g., consumer database) with the minimum amount of latency and processing overhead in the application and consumer database. For example, an application on a user device must be able to read/request data from a source database on a storage device with the minimum amount of latency in requesting the data, receiving the data, storing the data, and processing the data.
Although RDBMS, BigData, and Cloud ecosystems are viable options, other distributed storage systems are increasingly becoming options that companies are investing in. Companies are selecting these other options because they provide more flexibility in file types that a user (e.g., an application developer or data scientist) may find easier to work with when developing applications. For example, a data scientist interested in analyzing data collected by one or more sensors may find it easier to work with data in a file format such as a comma separated value (CSV) file format, JavaScript object notation (JSON) file format, extensible markup language (XML) file format, and/or tab separated value (TSV) file format. Another example would be cloud-first data collection agents that collect Internet-of-Things data or weblogs into simple Cloud object store platforms such as Amazon S3 directly. A data scientist may want to read and process subsets or aggregates of these logs directly from the object store, without having to first load the data into some RDBMS or Big Data processing backend. Some RDBMSs and Cloud ecosystems require a user to use a proprietary file format when developing applications that require access to data stored on these (RDBMS and Cloud ecosystems). Accordingly, the user is tasked with developing an interface between the application they are developing and the proprietary file formats or applications running on these systems.
The systems, methods, and devices disclosed herein are platform (e.g., processor and/or operating system) agnostic and therefore are not limited to any specific platforms or architectures of RDBMSs or Cloud ecosystems. The systems, methods, and devices in the instant application provide an interface between user devices, requiring access to data stored on alternative storage devices (e.g., distributed file systems or distributed storage devices), other than RDBMSs and Cloud ecosystems, and the alternative storage devices themselves. Accordingly, the systems, methods, and devices disclosed herein enable the user devices to execute one or more applications requiring data stored in any file format (e.g., CSV, TSV, XML, etc.)