Field
The present disclosure generally relates to methods and apparatus for organizing data in a database, and more particularly to organizing storage of sparse data in a database.
Background
Relational databases have greatly enhanced the ability to efficiently store and manage structured data. Such databases can manage multiple terabytes of data, for example, and support distributed computing and applications. Most commercial relational databases require users to create data models to represent the relationships that govern the domain from which the data stored is extracted. Data models can be normalized or broken into manageable logical entities based upon relationships observed in the data model. This optimizes performance for applications using the database. In turn, normalization necessitates creation of various tables in the database, with each table containing columns that represent attributes relevant to the domain being modeled. The records in these tables represent observed values for these attributes from the domain. Further, multiple tables can be joined using relationships between the columns and the values stored therein. Structured Query Language (SQL) is the standard computer language used to manage data stored in databases.
The advent of the Internet has created increasing demand for database systems that can model domains where the relationships governing the database may not be known a priori or may change frequently. As an example, an Internet based business might initially store data such as lists of products, prices, and quantity of each product in stock. As the business grows, however, it may be important to know and store other information such as who buys what products, where the purchases originate from, or other relevant information about the transactions. Known relational databases, by design, make it difficult to continuously change the underlying model representing such a business.
A solution to this problem may be to not normalize the data model, but rather store all information in flat databases. A flat database can consist of one large table where the columns represent attributes of interest, and the rows are records of observed values for these attributes. This approach is similar to using a spreadsheet for storing data, something routinely done in small businesses, for example. The resulting flat databases will contain sparse datasets, where most records have observed values for a few key attributes and the values for remaining attributes are left blank.
Accordingly, using relational databases to store sparse data or data, when the underlying model or schema frequently changes frequently works against an optimized database, and requires significant maintenance. A known approach for storage of sparse data involves the use of a column-oriented database system, which has been shown to be a substantial improvement over existing relational databases for storing sparse data. Such database systems, however, still require some modeling of the domain. Accordingly, this requirement can present a limitation when storing data for which the underlying schema changes frequently.