The present invention relates to large database systems such data warehouses on Hadoop, and more specifically, to predictively placing columns of a database in a multi-tier storage system during ingestion of data in the database.
As used herein, a large database is a columnar or hybrid columnar database which stores data tables as sections of columns of data rather than as rows of data and that includes at least one petabyte of data. Such large databases are currently used for a variety of applications in which very large amounts of data is stored, queried, aggregated, analyzed and searched such as business data analytics applications. Modern business data analytics applications compute aggregates over a large amount of data to roll-up information along an increasing number of dimensions such as geographic regions, demography, users, products, etc. Traditionally, online business analytics databases executed such queries by performing sequential scans over a significant portion of the database. As a result of the increasing sizes and dimensions, and increasing importance of interactive query response times of today's large analytics databases, querying the database by scanning the entire large database is not feasible. In addition to the size of large databases other factors make low latency querying of large databases with known techniques difficult. For example, in online analytics databases the percentage of database queries that are ad hoc in nature is high, which makes the creation of an index for the large database difficult. The large number of dimensions render techniques such as pre-computed cubes very space and computationally exorbitant. Simple caching solutions may not work either as the data may be too large to be cached (a large number of the analytics queries access both historical and recent data), it may have gotten overwritten by lengthy ingest, or ad-hoc queries may end up accessing data that wasn't accessed earlier. As a result, ability to quickly access un-cached data from storage plays an important role in interactive processing of ad-hoc queries.
Currently, these large analytics databases are stored on traditional data storage devices such as hard disk drives and the like. Recently, in an attempt to improve the performance of the large columnar databases, some large databases have been stored on high performance storage devices such as solid state devices or flash memory and the like. Flash offers a 40-1000× improvement in random accesses and allows much higher degree of internal IO parallelism than disks as flash devices are built on an array of flash memory packages. RAM's volatility mandates an additional durable copy of the data in a non-volatile medium, resulting in significant data duplication. Flash is non-volatile and hence, obviates the need to have data duplication; it can serve as the primary and the only tier for the data. We make a case for leveraging high performance, non-volatile, small footprint, and low power storage mediums such as flash in the Big Data storage hierarchy as a durable storage tier.
While storing large databases on high performance storage devices increases the speed of some queries on the large databases, the increased performance comes at a high cost as high performance storage devices are much more expensive than traditional data storage devices. A majority of the known database storage tiering techniques aim to place popular, frequently accessed data on the high performance, randomly accessible tier such as a Flash tier. The data popularity is determined in a reactive manner, whereby, a module records the access patterns of the workload and if a data set starts to get accessed a lot, it is deemed popular and is moved to the high performance tier. Such reactive techniques suffer from a reaction time to determine the access patterns and are unable to provide upfront performance guarantees to adhoc queries. Furthermore, a majority of these techniques are at much lower block-level and do not have semantic knowledge about the data at columnar level.
Flash is more expensive and has less space than HDD. It is important to place the right subset of columns in flash that yield highest performance per dollar in order to justify the higher cost of flash. All columns are not alike and won't yield the same performance increase by being placed in the flash. Flash yields highest performance when data is accessed randomly as flash is 40-1000× faster than HDD for random IOPs and only 2-7× faster for sequential accesses.