There is a growing interest in discovering knowledge from complex data which is organized as trees, rather than as a single relational table. Example applications include, but are not limited to, manipulating molecular data, XML data and Web content. By way of illustration, modern web applications often include content that is automatically generated using templates, whose content is filled from databases, or web toolkits. Such HTML documents can be very complex. For example, a search page presents a simple form that a user perceives as a few interface objects. But that search page may actually include a hundred or more objects. While automatically generated content tends to be complex, this type of content also tends to be consistent. Thus, the same functional components tend to have a similar Document Object Model (DOM) structure.
The importance of recognizing repeating structures in web applications has generated a large body of work on algorithms for mining the HTML Document Object Model (DOM). It has been recognized that HTML documents form trees, and a tree “edit distance” constitutes a good similarity measure between DOM structures. Consider, however, looking for patterns that form subtrees within a web page with many elements. The operations must be computed for all subtrees, and the execution time magnitudes of order higher. Considering the quantity of data in HTML, the size of the DOM for modern web applications, and the need for interactive pattern discovery, computation time remains an issue.