With rapid development and widespread utilization of computer technologies in the last few decades, large volumes of digital data are generated on a daily basis. Organizing and managing such a huge amount of data has promoted the development of database technologies. Relational database management systems (“RDBMS”), such as Oracle Database Management System, Microsoft SQL Database Management System and MySQL Database Management System, have thus been proposed and gained broad acceptance for data management. Such database management systems store data by rows of tables. Querying and retrieving data from the conventional databases oftentimes include retrieving a list of records while such records contain information that is not requested. For example, the illustrative SQL query causes a conventional database management system to read all fifty rows from a disk drive storing the rows:
select column1 from table1 where key>100 and key<151
In the illustrative SQL query, column1 is a column of a table 1, and key is another column (such as a primary key) of the table 1. While only data in column1 is requested, data in other columns of table1 is already read from a storage disk drive. Furthermore, the conventional database management systems do not store data in an ordered manner on physical disk drives. However, many types of data (such as network logs, network access data, financial transaction data, weather data, etc.) are of extremely high volume and ordered by time. Accordingly, there is a need for a highly parallel and efficient database system that is optimized for managing large volumes of time based data. There is a further need for a highly parallel and efficient database system for storing data by columns for faster and more efficient data retrieval.
Conventional database management systems typically generate a large number of indexes for data. Such indexes logically identify rows (also referred to herein as records). Rows of data within a table are stored on disk drives. Related rows, such as rows of a particular order by time, are usually not consecutive stored on disk drives. Rows could also be related by other factors. Retrieving a set of related records thus involves multiple disk reads of data dispersed at different locations on a disk drive. Accordingly, there is a need for a highly parallel and efficient database system for storing related data consecutively or nearby on a disk drive to reduce the number of disk reads in serving a data request, and providing an efficient structure for locating such data on a disk drive. There is a further need for the new database management system to load the structure in memory for higher performance in locating data on disk drives.
To improve data retrieval performance, conventional database management systems take advantage of high end hardware platforms, such as a computer with multiple sockets and a large amount of memory. Each of the sockets includes one or more processing units (also interchangeably referred to herein as cores). A processing unit housed in one socket can access resources (such as disk drives and memory) local to another socket. Such cross socket access incurs a performance penalty due to latency and bandwidth limitations of the cross-socket interconnect. Accordingly, there is a need for a highly parallel and efficient database management system that improves performance by avoiding the cross socket boundary access. The present disclosure incorporates novel solutions to overcome the above mentioned shortcomings of conventional database management systems.