US 12,169,507 B2
Database compaction in distributed data system
Todd Lipcon, San Francisco, CA (US)
Assigned to CLOUDERA, INC., Palo Alto, CA (US)
Filed by Cloudera, Inc., Palo Alto, CA (US)
Filed on May 28, 2019, as Appl. No. 16/424,083.
Application 16/424,083 is a continuation of application No. 15/073,509, filed on Mar. 17, 2016, granted, now 10,346,432.
Claims priority of provisional application 62/134,370, filed on Mar. 17, 2015.
Prior Publication US 2019/0278783 A1, Sep. 12, 2019
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 16/27 (2019.01)
CPC G06F 16/278 (2019.01) 20 Claims
OG exemplary drawing
 
1. A computer-implemented method for organizing data into segments to improve read performance in a distributed database system having at least a master node and a slave node, the slave node including a data node that hosts a plurality of tablets, the plurality of tablets being a part of a large distributed table managed by the master node, the method comprising:
instantiating corresponding components of a compaction engine respectively on the master node and on the slave node, wherein the master and the slave nodes are physically separated but communicatively coupled together via a network;
determining, by the compaction engine, a height of a tablet within a keyspace of the tablet based on a number of rowsets, in a plurality of rowsets included in the tablet, that have key ranges that overlap;
determining, by the compaction engine, a rowset width of each rowset in the keyspace of the tablet based on a percentage of the keyspace to which the rowset corresponds;
until a minimum operational cost is reached, iteratively calculating, by the compaction engine, an operational cost associated with compaction of two or more rowsets in the keyspace based on the height of the tablet and the rowset widths of the two or more rowsets;
selecting, by the compaction engine, two or more particular rowsets for compaction based on the two or more particular rowsets resulting in the minimum operational cost; and
performing, by the compaction engine through communicating instructions between the master node and the slave node over the network, a compaction of the two or more particular rowsets, the performed compaction resulting in a merger of the two or more particular rowsets,
wherein the compaction results in one or more of: reduction in the height of the tablet, removal of overlapping rowsets, and/or creation of smaller sized rowsets.