The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
“Big data” describes a collection of data sets so large and complex that it becomes difficult to process using traditional database management tools or traditional data processing applications. Today, enterprise and data center applications demand increasingly large data sets be processed with minimal latency.
One consideration for data processing application deployments is configuring a database system for optimal query performance. A large data set may comprise hundreds of columns across billions of rows, and queries targeting the data set may include predicates on any column of the hundreds of columns. For example, insurance data may include hundreds of insurance portfolios, each of which comprise numerous insurance contracts, which in total cover hundreds of thousands of properties. Each property may include hundreds of attributes, such as address, regional information, soil type, structure type, and etc.
An example query may be a request to retrieve all properties located five miles from a coastline that have a wooden structure and sit on top of sandy soil. A typical database system may not be scalable to store this amount of data. Furthermore, such a request would take a significant amount of time to compute and provide results. Key-value data stores, such as Cassandra or HBase, may have better query processing times but cannot process queries that could have predicates on any data column.
Thus, a data processing system that accepts queries on any column of a large data set and provides search results without significant delay is desired.