1. Field of the Invention
Embodiments of the present invention generally relate to operation of a NoSQL database, and, in particular, to a system and method for efficiently compacting a large NoSQL database in which a majority of the data is substantially static.
2. Description of Related Art
Conventional Structured Query Language (“SQL”) refers to a special-purpose programming language designed for managing data in a relational database management system (“RDMS”), which is usually a table-based database. When an existing row is updated, an SQL database will read in the existing row, update it, and rewrite it. Similarly, on a row delete, SQL would delete the row from the table.
NoSQL refers to a broad class of database management systems identified by non-adherence to conventional SQL. A NoSQL database does not work in the same way as a typical SQL database. NoSQL databases are not built primarily on tables, and generally do not use SQL for data manipulation.
NoSQL database systems are often highly optimized for retrieval and appending operations and often offer little functionality beyond record storage (e.g. key-value stores). When an existing row (i.e., a record) is updated, a NoSQL database will write a new copy of the row. Upon reading a record, the NoSQL database will find the newest version of the row and chooses that as the correct version. Upon deleting a record, the NoSQL database will write a new copy of the row, with the new copy being marked as a deleted record. The reduced run-time flexibility of NoSQL compared to conventional SQL systems is compensated by marked gains in scalability and performance for certain data models. NoSQL database management systems may be useful when working with a huge quantity of data when the nature of the data does not require a relational model.
An example of a NoSQL database is Cassandra. Cassandra is known as an open source distributed database management system that is designed to handle large amounts of distributed data. Cassandra provides a structured key-value store. Keys map to multiple values, which are grouped into column families. The column families are fixed when a Cassandra database is created, but columns can be added to a family at any time. Furthermore, columns are added only to specified keys, so different keys can have different numbers of columns in any given family. The values from a column family for each key are stored together.
Database operation typically includes adding, deleting, and modifying records. Over time, storage space may be wasted where obsolete or now-deleted records were stored, causing the database size to be larger than necessary. Compaction is a process by which such wasted space within a database may be reclaimed. Compaction serves two main purposes: iteratively reordering the data so they can be accessed faster, and getting rid of obsolete data, i.e. overwritten or deleted, in order to reclaim space.
Periodically, the NoSQL database will gather up groups of data files for a particular column family (similar to a table in a conventional SQL database) and compact the group of data files. This involves writing into new files all the current versions of the rows that exist in that group of data files. This eliminates the duplicate versions, and allows deletions to occur.
Compaction processes of the known art require copying of data over and over. For a large database whose data is also substantially static (i.e., most of the data does not change in between instances of the compaction process), compaction processes of the known art are relatively inefficient. The inefficient processes for this type of database result in excessive resource usage such as excessive CPU usage, excessive disk writes, and excessive temporary disk usage for overhead purposes. Furthermore, the excessive resource usage tends to discourage a more frequent compaction schedule.
Therefore, a need exists to provide a more efficient compaction process tailored to large databases whose data is also substantially static, in order to overcome the aforementioned shortcomings of the known art, and ultimately to provide improved customer satisfaction.