More and more customers start to learn cluster computing technology, such as with an Apache® Spark® cluster, and leverage its power and ease of use with a consistent set of Application Programming Interfaces (APIs) to do batch, interactive, and stream data processing. (Apache and Spark are registered trademarks of the Apache Software Foundation in the United States and/or other countries.)
Data that is stored in a cluster at a customer site may be referred to as data in an on-premise cluster. That is, an on-premise cluster is typically at the customer's physical location. The processing of data may require transferring data from the customer site to a cloud cluster, performing processing, and returning the data to the customer site. However, in order to protect data going into and coming out from the cloud cluster, a secure gateway needs to be configured first to transfer data with proper security protection.
This approach causes multiple issues. For example, data transfer is inefficient when working with a large dataset to move that data to the cloud cluster, perform data processing on the data, then send the data back to the on-premise cluster. In addition, there is a potential security issue, and to mitigate the risk, some systems have introduced data masking technology to mask key columns of the data, such as Personal Identification Information (PII) (i.e., social security number), and this introduces computational complexity.