Businesses accumulate and archive millions of electronic items. Countless emails are sent and received daily. Workers routinely generate new documents. Paper documents are scanned for digital storage. Many pictures and flat files are converted into digital text by optical character recognition. All this activity produces electronic data that is highly unstructured. A directory may contain millions of office documents. An exchange database file may contain millions of email messages, some which contain attachments like zip files or office documents. A zip file can contain office documents, an email message can contain attachments, an Outlook PST file can contain email which itself might contain a PST file.
Some businesses attempt to store all of this in a storage system or archive. Unfortunately, archiving systems introduce added levels of complexity. For example, some archiving systems break up stored electronic items into components and store those components in separate databases, files or disks. For example, an email archive may store body text, headers, and attachments separately. Such a storage structure hides the size and extent of electronic items that would satisfy any certain search criteria until the archive is properly indexed.
Some systems for indexing archives use multiple processors. For example, U.S. Pub. 2008/0030764 to Zhu describes a system in which a primary processor divides a job into work items for secondary processors. Unfortunately, since the internal structure of the archive is not known a priori, any given work item may turn out to be trivially small or unmanageably large. As a result, an entire system can stand idle for days while waiting for a single processor to slog through one email box.