This specification generally relates to data processing techniques, and more specifically to performing hash joins in a way optimized for parallel processing computer systems (e.g., multi-core processors).
The growth of data analytic platforms, such as Big Data Analytics, has expanded data processing into a tool for processing large volumes of data to extract information having business value. Efficient data processing techniques are needed to access, process, and analyze large sets of data from differing data sources for this purpose. For example, a small business may utilize a third-party data analytics environment employing dedicated computing and human resources to gather, process, and analyze vast amounts of data from various sources, such as external data providers, internal data sources (e.g., files on local computers), Big Data stores, and cloud-based data (e.g., social media information). Processing such large data sets, as used in data analytics, in a manner that extracts useful quantitative and qualitative information typically requires complex software tools implemented on powerful computer devices.
A join algorithm is a data processing technique employed when processing multiple data sets such as those described above. Existing data processing systems can utilize multiple join algorithms, each having respective performance tradeoffs, to perform logical joins between two sets of data (e.g., hash joins, nested loops, sort-merge joins). As an example, the hash join has expected complexity O(M+N), where N and M are the number of tuples of two tables being joined. However, the hash join algorithm may have unfavorable memory access patterns (e.g., random disk access) and may also be slow to execute. Thus, existing data processing systems suffer performance issues when processing join algorithms.