A distributed system generally includes many loosely coupled computers, each of which typically includes a computing resource (e.g., one or more processors) and/or storage resources (e.g., memory, flash memory, and/or disks). A distributed storage system overlays a storage abstraction (e.g., key/value store or file system) on the storage resources of a distributed system. In the distributed storage system, a server process running on one computer can export that computer's storage resources to client processes running on other computers.
The distributed storage system may store “Big Data,” which is a term generally used for a collection of data sets that are so large and complex that on-hand database management tools or traditional data processing applications have a hard time processing (e.g., capturing, storing, searching, transferring, analyzing, visualizing, information privacy, etc.). Big data size is constantly expanding, ranging from a few dozen terabytes to many petabytes (1 PB=1015 bytes=103 terabytes) of data.
Big data may be characterized by what is often referred to as the five V's: volume, variety, velocity, value, and veracity. Volume relates to the quantity of data, which determines whether the data may actually be considered Big Data. Variety refers to the categories that Big Data belongs to, which are known by data analysts allowing them to analyze the data. Velocity relates to the speed of generating the data or how fast the data is generated and processed to meet the demands and the challenges of the growth and development of data. Variability refers to the inconsistency of the data, which ultimately affect the efficiency of handling the data. Finally, veracity refers to the accuracy of the data source leading to the accuracy of the analysis of the data.