In a distributed database system, creating appropriate indexes allows the optimizer to speed up queries and find data more efficiently. Index creation operations are frequently complex, time-consuming, and error-prone. Some systems attempt to address these issues by a parallel index creation approach using a producer-consumer mode, where a producer thread scans data rows and passes them to various consumer threads based on a distribution map. However, given the dependency of the consumer threads on the producer thread and dependencies among the consumer threads, such producer-consumer approaches cannot satisfy the requirements of creating index on large tables under certain circumstances, especially when the underlying base table contains hundreds of Giga-bytes of data.
For example, in the event that the base table is compressed, due to the necessity of the producer to decompress and pass the rows to consumer threads, the producer may become bottleneck. Furthermore, the base table, particularly the leading columns in the index key columns, may be near sorted, which leads to uneven distribution where the producer processes rows for a specific consumer, and only one consumer is busy at a specific time. Moreover, the data exchange between the producer and the consumer threads uses limited resources, such as data buffers, which may cause the threads to compete for resources and wait for each other.
The problems are exacerbated by the fact that frequently only one producer is supported for creating indexes in such systems. Therefore, conventional systems fail to provide an ideal parallel index creation mechanism with optimal performance, CPU scaling and availability.