A common approach for modeling high-throughput data flow processing is to represent the data flow as a directed graph, in which nodes represent computation resources and edges represent data transmission paths among the nodes. In such cases, nodes can be decoupled from each other by using asynchronous data transmission. This decoupling allows each computation node to execute as efficiently as possible since it does not have to wait for downstream nodes to complete processing before it can begin processing the next message. In some cases, multiple computation nodes can be executed in parallel and together act as “single” computation node, thus processing many units of work simultaneously.
A Staged Event Driven Architecture (SEDA) enhances this approach by inserting bounded queues between computation nodes. When a node A attempts to transfer work to another node B, if the queue between the nodes A and B is full, then A blocks until B has consumed some work from the queue. This blocking of A prevents A from consuming new work which in turn causes its input queue to get full, blocking any predecessors. One example of a process that utilizes such a technique is search engine document ingestion, in which multiple forms of documents (emails, PDFs, multimedia, blog postings, etc.) all need to be processed and indexed by a search engine for subsequent retrieval.