High-throughput data flow processing is commonly implemented by representing data flow using a directed graph, in which nodes represent computation resources and edges represent data transmission paths among the nodes. In such cases, nodes can be decoupled from each other by using asynchronous data transmission. This decoupling allows each computation node to execute as efficiently as possible since it does not have to wait for downstream nodes to complete processing before it can begin processing the next message. In some cases, multiple computation nodes can be executed in parallel and together act as a single computation node, thus processing many units of work simultaneously.
A Staged Event Driven Architecture (SEDA) enhances this approach by inserting bounded queues between computation nodes. When a node A attempts to transfer work to another node B, if the queue between the nodes A and B is full, then A blocks until B has consumed some work from the queue. This blocking of A prevents A from consuming new work which in turn causes its input queue to get full, blocking any predecessors. One example of a process that utilizes such a technique is search engine document ingestion, in which multiple forms of documents (emails, PDFs, multimedia, blog postings, etc.) all need to be processed and indexed by a search engine for subsequent retrieval.
A scalable system that can process large amounts of data can be provided by using such asynchronous directed graph models. In some applications, documents may need to be processed in order. However, a system based on an asynchronous direct graph model generally cannot guarantee that documents are processed in order. One prior solution to this problem, described in U.S. Patent Publication 2010/0005147, is a system in which all messages are processed in order.