Automated data processing often involves the joining of tables. Specifically, fields from two or more tables are combined by using common values to create a combined table. FIG. 1 is a diagram illustrating two example tables that are joined. Table 152 is referred to as a driving table (or anchor table), which includes the following columns of data records: transaction identifier, buyer name, seller name, and transaction data. Table 154 is referred to as a lookup table, which includes the following columns of data records: account identifier, user name, and zip code. The buyer name column from table 152 and the user name column from table 154 can be used as the columns to be joined, and the joined result is table 156. This particular form of join is referred to as the left join, where the result table contains records in the anchor table (also referred to as the left table) regardless of whether there is a matching record in the lookup table (also referred to as the right table).
To perform a join in an existing language or program that supports the join operation, such as Structured Query Language (SQL) or Microsoft Excel®, the operator-user typically needs to specify the columns to be joined. Take the example shown in FIG. 1, to join the tables, the operator-user typically needs to specify that the buyer name column from table 152 and the user name column from table 154 are to be used as the columns to be joined. To effectively identify which columns should be joined can be difficult for the operator-user, especially when the amount of data involved is large. For example, there are often multiple join strategies available (e.g., using buyer name and user name as the join columns, using seller name and user name as the join columns, etc.), and sometimes the data that can be joined does not match exactly (e.g., the data in one table includes punctuation while the data in the other table does not). Further, the data tables being processed can originate from different sources (e.g., a transactions table from a web service operator and a products table from a vendor) and the data formats can be different (e.g., the transactions table lists phone numbers in a single column, while the products table lists area codes and local phone numbers in two separate columns). Thus, it can be difficult to detect and evaluate the possible joins.
Moreover, for large scale web applications, it is not uncommon for tables to have several billions of records; it is therefore nearly impossible to manually review the contents of the records to determine which columns have records that can be combined. Further, computer-based comparison of the records for determining whether the records match can be computationally intensive, which can lead to slow application response time.