Some computational tasks are especially suitable for recursive processing. One example of an operational task that lends itself especially well to recursive processing is “frequent itemset” determination. An “itemset” is a set of items. For example, one itemset might include the items (apple, banana), while another itemset might include the items (apple, orange), while yet another itemset might include the items (banana, orange). An itemset is “frequent”, relative to a set of data structures, if the number of the data structures that contain all of the items in the itemset is at least a specified fraction of the total number of the data structures in the set.
For example, in a set of three data structures, each data structure might represent a different customer's transaction at a supermarket. A first data structure might contain the items (apple, banana, milk), while a second data structure might contain the items (apple, banana, milk, orange), while a third data structure might contain the item (orange). Assuming that the specified fraction is ⅔, the itemset (apple, banana) is a frequent itemset because “apple” occurs with “banana” in two of the three data structures, but the itemsets (apple, orange) and (banana, orange) are not frequent itemsets because “apple” occurs with “orange” in only one of the three data structures and “banana” occurs with “orange” in only one of the three data structures. As the number of data structures in a set of data structures increases, the determination of whether a particular itemset is frequent relative to that set of data structures becomes more computationally intensive.
Frequent itemset determination lends itself especially well to recursive processing due at least in part to the observation that an N-element itemset cannot be a frequent itemset relative to a set of data structures unless all of the (N−1)-element subsets of the N-element itemset are also frequent itemsets relative to that set of data structures. For example, the 3-element itemset (apple, banana, milk) cannot be a frequent itemset relative to the set of data structures in the above example unless all of the 2-element subsets of that 3-element itemset, namely, (apple, banana), (apple, milk), and (banana, milk), are also frequent itemsets relative to the set of data structures in the above example.
This observation allows the computationally intensive determination of whether itemsets are frequent to be performed for fewer itemsets. The determination of whether a particular N-element itemset is frequent needs to be performed only if all of the (N−1)-element subsets of the particular N-element itemset are also frequent. Thus, for each successive value of N, the group of N-element itemsets for which this determination needs to be performed can be based on the determinations already performed for the (N−1)-element itemsets. Frequent itemset counting is, therefore, a task that can be performed more efficiently using a recursive approach.
According to one theoretical approach, frequent itemsets might be determined in the following manner. An application that is external to a database server might send a query to the database server. When executed, the query would cause the database server to select, from a set of data structures, each data structure that contains all of the items in a specified itemset. The database server would execute the query and return the selected data structures to the application. The application might count the selected data structures and determine whether the number of selected data structures meets a specified threshold. If the number of selected data structures met the specified threshold, then the application might place the specified itemset in a set of frequent itemsets. The application might perform the above steps for each 1-element itemset that is a subset of an M-element itemset, one 1-element itemset at a time, and one 1-element itemset after another.
Once the application had performed the above steps for each such 1-element itemset, the application might determine, for each particular 2-element subset of the M-element itemset, whether all of the 1-element subsets of that particular 2-element subset are contained in the set of frequent itemsets. If all of the 1-element subsets of the particular 2-element subset were contained in the group of frequent itemsets, then the application might send, to the database server, a query that would cause the database server to select, from the set of data structures, each data structure that contains all of the items in the particular 2-element itemset. The database server would execute the query and return the selected data structures to the application. The application might count the selected data structures and determine whether the number of selected data structures meets the specified threshold. If the number of selected data structures met the specified threshold, then the application might place the particular 2-element itemset in the set of frequent itemsets. The application might perform the above steps for each 2-element itemset that is a subset of the M-element itemset, one 2-element itemset at a time, and one 2-element itemset after another.
For each successive value of N, the application might perform the above steps for the N-element itemsets that are subsets of the M-element itemset until N was greater than M or there were no (N−1)-element itemsets in the set of frequent itemsets, whichever came first. Thus, by sending a multitude of queries to a database server in serial manner and counting the results of such queries, the application might determine frequent itemsets that are subsets of the M-element itemset.
Unfortunately, considerable overheard would be involved in the above approach. It would take significant time for the application to send the many queries to the database server and for the database server to send the results of the many queries back to the application.
Furthermore, because most of the operations performed in the above approach would be performed by the application (the database server would just execute queries and return the results), application programmers would be burdened with implementing the functionality required to perform most of the operations involved in the above approach.
These are some of the problems that would attend the above approach. A technique that overcomes these problems is needed.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.