Every day, several quintillion bytes of data may be created around the world. This data comes from everywhere: posts to social media sites, digital pictures and videos, purchase transaction records, bank transactions, sensors used to gather data and intelligence, like climate information, cell phone GPS signal, and many others. This type of data and its vast accumulation is often referred to as “big data.” This vast amount of data eventually is stored and maintained in storage nodes, such like hard disk drives (HDD), solid-state storage drives (SSD), and the like, and these may reside on networks or on storage accessible via the Internet, which may be referred to as the “cloud.” This stored data may also require processing, or be subject to operations, such as during a search, query, encryption/decryption, compression, decompression, and other processes. Typically, a processing device, such as a central processing unit (CPU), in a server performs operations on the data. The data is read from the storage node, processed by the CPU and the processed data is sent to the source of a request and/or stored back on the storage node. Standard storage nodes generally do not include computational resources to perform such operations on data stored in the storage node.
Moreover, standard storage node interfaces, such as Serial Advanced Technology Attachment (SATA), Fibre Channel, and Serial Attached SCSI (SAS), do not define commands to trigger the storage node to perform data operations in the storage node. Accordingly, operations are performed outside of the storage node, e.g., in a server CPU. To perform such an operation, a server uses standard read and write commands supported by existing storage node interfaces to move data from and to the storage node. Specifically, the server sends a standard read command to the storage node via a bus. The storage node then sends the stored data over the bus to the server, which typically holds the data in its main memory. The CPU of the server then performs operations on the data to produce a result. Depending on the type of operation, the server provides the result to a requesting source and/or stores the result on the storage node.
There are several disadvantages associated with this process of reading the data from the storage node, and processing the data within the server, and potentially storing the processed data back on the storage node. Because of these disadvantages, the process of performing data operations on the server is referred to as “costly” or “expensive” in terms of device performance and power consumption. Because the server is involved in every step of the process, this process occupies the CPU of the server, consumes power, blocks other user operations that otherwise could have been performed, and requires that the server contain a buffer, or a larger buffer than would otherwise be needed. The buffer is typically the main memory of the CPU, or double data rate (DDR) random access memory. This process also ties up the communication bus between the server and the storage node since data is sent from the storage node to the server and then back to the storage node. In other words, existing processes for searching and analyzing large distributed unstructured databases are time-consuming and use large amounts of resources such as CPU utilization, memory footprint, and energy. Finally, these processes prevent the storage node management system from performing more sophisticated optimizations.
In summary, typical operations like search (query), decryption, and data analysis are, in existing systems, performed on the local server's CPU. Search and processing are performed over the entire data residing in storage nodes (e.g., solid state drives (SSDs), hard disk drives (HDDs), etc.) within the server. Data needs to be moved from the storage node into the CPU memory where it can then be processed. This is inefficient, e.g., slow, because a single server CPU, which may control a large collection of storage nodes, has relatively little processing power with which to process the large volume of data stored on the collection of storage nodes. Moreover, requiring the server's CPU to do this work makes inefficient use of energy as well, in part because a general-purpose CPU like a server CPU generally is not designed to perform operations such as searching efficiently, and in part because transferring data over a data bus and across the interface to the storage node requires a significant amount of power. Thus, there is a need for a system and method for more efficiently processing data stored on storage nodes.