A data processing system, such as one or more computer processors running an application program, has input data and output data (“I/O”). The inputs are often received from or returned to a physical storage device referred to as a memory. For example, the memory may be RAM (a random access memory), a disk (e.g., a hard drive), or a memristor (a memory-resistor). A repository of information stored as bits of data on one or more of the memories may be referred to as a datastore. The datastore may be used to store, manage, and distribute electronic data, and may store any of the data resources produced and/or utilized by an individual and/or an organization. The efficiency with which data can be addressed, retrieved, and stored on the memory, and the security of the information in the datastore, may be important to an organization such as an enterprise, a cloud computing platform or a social network.
To facilitate the I/O of a computer, data is stored in the memory according to a particular arrangement. A data model provides the framework for organizing and representing data. The data model may be viewed as a set of rules and/or constraints for the arrangement of the data that defines permissible data structures that are built within the physical storage device. A data structure may be seen as a physical implementation of a data model's logical organization, and the data structure occurs as physical interrelationships among stored bits of data in the memory. The data structure may determine how the data processing system can interact with the datastore, including how pieces of data can be deposited, queried, addressed, and/or retrieved.
The way in which data is added to or retrieved from the memory may directly relate to the efficiency of the data processing system. For example, the data structure may determine how much time and energy a computer must expend to write the piece of data to the memory (e.g., add information in an organized way to the datastore) or to find, read, and retrieve a specific set of data from the memory. In some cases, the storage and retrieval may be the slowest aspect of a computer process. As one of the elements working with the I/O, the data structure may therefore be a factor in increasing efficiency of the computer. For example, this may be true for computer servers that power the back-end of e-commerce websites, social networks, enterprise computer applications, content delivery networks, “big data” analytics, and cloud computing services. Sometimes spread over wide geographic areas, networks of computer servers holding a distributed datastore may have to handle thousands, millions, or even billions of users, each requiring unique data to be stored and retrieved at all times and in unpredictable spikes of activity.
Related to normal operational efficiency is analytical efficiency. The data structure may also determine what types of analysis can be conducted on information in the datastore without excessively taxing computing resources. The data structure may therefore determine what insight can be practically derived from the datastore. Such analysis may crucial to make key strategic business decisions, to maintain and expand the data processing system itself, or even as the primary product or service offered by an enterprise.
The data structure placed in the memory may also directly determine which controls can be defined over access to and/or use of information in the datastore. Attempting to control access or use of data may require additional read and write operations that can add significant overhead to the data processing system. The data structure may therefore practically determine what security features can be implemented in a system having a large number of users.
A file system stores data according to a data model in discrete chunks referred to as “files.” Some file systems may be organized according to a hierarchical data model that may limit how a file can be stored, retrieved and controlled. For example, an entire leg of a hierarchical data structure, often represented by cascading series of directories, may need to be analyzed by a computer in order to find a specific file. “Walking” these legs each time the specific file is requested may represent significant time and/or energy. The way in which a file is addressed within the hierarchical data structure, which may be through use of a “file path,” may also cause application programs dependent on the file to break if the file is moved to a new leg of the hierarchy. Similar breakage may occur when an intermediate directory along the path between the root of the hierarchy and the file is deleted. Thus, addressability of a piece of data within the hierarchy may be associated with a taxonomy (e.g., how directories are organized) and a namespace (e.g., how directories are named or labeled). This dependence on locality may lead to repeated copying of a file to ensure it remains addressable, potentially leading to inconstancies between the original and the copy. Additionally, it may be difficult to represent non-hierarchical relationships between files and/or directories. A hierarchical file system may also attempt to control files with an external security system called an access control list. Under such an access control system, once a file is removed from the datastore its use and further dissemination may be beyond control of the file system and the organization that administers it.
A database may also be used to store data of a datastore. A particular database is comprised of a set of technologies that includes an application program that utilizes a particular data model to store data with and retrieve data from a datastore. For example, a database may include modules for processing a query language, buffering queries, and handling data recovery and security of the datastore. Databases may commonly store data at a finer granularity than a file system. A database may also be referred to as a database management system.
A relational database uses a relational data model comprising tables having sets of rows and columns. The resulting data structure within the memory may be relatively rigid, and may require foresight as to what information will be stored so that expectant tables can be set up ahead of data entry. Relationships between pieces of data within a relational database may need to be defined programmatically, that is, within the software code of the application program making use of the data. Defining relationships programmatically may increase time and energy required to store, retrieve and process information, and may make it difficult to determine relationships between data resources when examining the datastore. Relational databases may have similar access controls to a file system. Relational databases may also have difficulty when operating as part of a distributed datastore, for example by intolerance of network partitions that temporarily segregate two parts of the datastore.
Another class of databases are “NoSQL” databases, which may have been developed as an alternative to traditional relational databases and designed for use over distributed networks. The data models employed by NoSQL databases vary, and may include a column model, a key-value store, a document model, and/or an entity-attribute-value model. Some of these models are implemented “on top of” others. For example, some commercial databases implement a document model on top of a key-value store.
However, data models employed by NoSQL databases may be relatively under-defined. Some NoSQL databases are even referred to as “schemaless.” For example, a dearth of organizational definition may mean that data structures within the memory may have no consistent format and/or organization, which may result in extra time and energy to address, retrieve and/or analyze data. Additionally, two distinct datastores utilizing the same data model (e.g., a document model) may arrange data within the limited rules of the model differently and may prevent the two datastores from being readily compared. “Big data” tools (e.g., MapReduce) may be used in an attempt to resolve data structure discrepancies but may require significant storage and computing resources.
Current data structures may result in a relatively large expenditure of time and energy by the data processing system, unnecessary duplication of data, limited analysis, and limits on the form of security and control that can be placed on information within a datastore. These inefficiencies can impact an organization, especially as the organization attempts to scale a network of computers to meet rising storage and computational demand. Not only may the costs of electricity, bandwidth and hardware increase faster than revenue, but inefficiencies may cause a file system, a database, or an application program to slow or crash, including those that power internet-scale businesses. The commercial transactions that the data processing system would otherwise have conducted may represent thousands or millions of dollars in lost revenue. Experiencing sluggish or intermittent service, users may leave for competing services.
To lower transactional costs organizations may be forced to adopt multiple databases or file systems for different aspects of their organization, each of which may employ a different data model. Translating data between databases and/or data models may impart its own set of inefficiencies (e.g., impedance mismatch) and such systems may constrain future development due to increased complexity.