There are a large number of components involved in modern enterprise applications. This large number of components require many traversals of the data set for the data to flow from one component to the next—on the order of a constant times n traversals for n components—and this time often dominates over the time spent on the actual business logic of the application in terms of time spent by the CPU. Most enterprise application environments do nothing to eliminate unnecessary traversals, because for such a system to consolidate traversals today, components, which are usually written in procedural languages, must be recoded to be aware of one another, reducing the reusability of these components. The only way to keep the benefits of componentization and remove the drawbacks of the many recodings of the data, would be to pass the components to a compiler and then compile the recodings away—this process is known as deforestation. We know of no system today that has a generic mechanism for eliminating redundant and unneeded traversals.
A large class of real-world enterprise business applications are written in components which require reformulating all of the data that flows through them a number of times. Each time requires time to do the reformulation as well as memory to store the reformulated data.
Deforestation is the process of program optimization to remove intermediate trees. Finite state automata have long been knows as a general purpose computing construct, well-known by computer scientists, and easy to understand. Also known are ways to turn general functions into finite state automata, and then collapse any sequence of successive finite state automata into a single automata, which accomplishes some more deforesting.
Extensible Markup Language (XML) processing is one field where there is low performance resulting from many disparate components, and the many resulting data recodings. XML has begun to work its way into the business computing infrastructure and underlying protocols such as the Simple Object Access Protocol (SOAP) and Web services. In the performance-critical setting of business computing, however, the flexibility of XML becomes a liability due to the potentially significant performance penalty. XML processing is conceptually a multitiered task, an attribute it inherits from the multiple layers of specifications that govern its use including: XML, XML namespaces, XML Information Set (Infoset), and XML Schema, followed by transformation (XSLT), query (XQuery), etc. Traditional XML processor implementations reflect these specification layers directly. Bytes are converted to some known form. Attribute values and end-of-line sequences are normalized. Namespace declarations and prefixes are resolved, and the tokens are then transformed into some representation of the document Infoset. The Infoset is optionally checked against an XML Schema grammar (XML schema, schema) for validity and rendered to the user through some interface, such as Simple API for XML (SAX) or Document Object Model (DOM) (API stands for application programming interface). Finally, higher-level processing is done, such as transformation, query, or other Web Services processing.
With the widespread adoption of SOAP and Web services, XML-based processing, and parsing of XML documents in particular, is becoming a performance-critical aspect of business computing. In such scenarios, XML is usually being processed by languages such as XSLT and XQuery, etc. In total, this leaves processing at many independent levels: XML parsing, validation, deserialization, transformation, query, etc. This division into separate layers of processing fits well with current software engineering practices which encourage reusable pieces of code to be packaged into components. To create a complete application, a number of components—often written by different authors or under different circumstances—must be assembled. Enterprise applications typically process data in high volumes, and as such, large quantities of data pass through the components that make up the application. Most components, as part of their normal function, will have to make at least one traversal through this data. In addition, because of the diversity of their origin, each component often requires data to be packaged in a very specific form, and a considerable amount of time is also spent traversing the data set to convert it from one form to another as it is sent through the various components.