Businesses use cloud computing services such as Microsoft's Office 365, Exchange On-line, SharePoint on-line, Google's Gmail, Google Drive or others to host and store billions of electronic items. Countless emails are sent and received daily. Workers routinely generate new documents and store them on cloud storage systems. Paper documents are scanned and sent by email. Many pictures and flat files are converted into digital text by optical character recognition. All this activity produces electronic data that is highly unstructured.
Cloud computing services offer the ability to create and keep that data in a storage system that is distributed across any number of storage servers, server storage and data centers. Any business's data may be arbitrarily complex. An exchange database file may contain millions of email messages, some which contain attachments like zip files or office documents. A zip file can contain office documents, an email message can contain attachments, which themselves may contain email that might even contain a PST file. Cloud computing services can distribute any and all of that material across numerous physical computers in a number of different datacenters. Such a storage structure hides the size and extent of the data.
Existing approaches to indexing cloud base data for eDiscovery often requires copying it all from the cloud, through a firewall, to a local “terrestrial” storage system for processing or indexing. This requires the maintenance of a physical non-cloud infrastructure (whether owned by the corporation or its litigation service provider) causing excessive delays in performing electronic discovery activities. The reason the mail is downloaded from the cloud is that efficient processing requires the processing computers to reside close to the data. Should the data and processing computers be remote, processing speed is limited by the size of the network connections, which slows processing to a crawl, introduces processing errors and which ultimately results in processing failure. Some eDiscovery service providers purport to be cloud-based services providers. However, those services are limited to legal hold, document review and rudimentary searching. To achieve the level of detail required to satisfy regulatory requests or litigation discovery, the data must be downloaded and processed (e.g., indexed). For example, cloud-based simple search capabilities will typically ignore non-text searchable documents such as PDF and TIFF files, password encrypted documents, corrupted items, attachments and embedded files, simple or complex zip-files, any of which may contain one or more such items, recursively. They will also provide limited or no search capability over images as companies look for pornography, illegal images, and intellectual property which has been pictorialized. The downloaded documents are processed, optical character recognition (OCR) is performed, password cracked, searched and analyzed and then a specific subset of documents may be loaded onto a review platform which may require uploading to the same or different cloud. However, the complex processing and structuring work is typically NOT completed in the cloud.
Such approaches result in excessive delays. One reason these analyses are slow is that all data must pass through the business's account with the cloud services provider, and the cloud services provider will typically provide a limited throughput per access point also called “throttling”. It is further limited by bandwidth restrictions at the end user's location.