The field of data forensics is concerned with the analysis of data obtained from third-party sources. For instance, a law enforcement agency may take possession of a computer hard drive and desire to know the contents. An intelligence agency may collect large amounts of data from one or more electronic sources and may need the information reduced to a searchable format. More prosaically, a company may store large archives of data which are not typically accessed, or a computer hard drive may be damaged in some way, resulting in a need to recover the stored data.
In computer forensics, investigators may need to consider every byte of data of a set of electronic data or “evidence file”, such as a copy of a hard drive (known in the art as a “disk image”) or other digital storage media, and treat this data with intensive process, such as looking for known contraband, highlighting images of child pornography and counterfeit currency, recovering internet web surfing history, searching for keywords related to the investigation, and other processes. Common, therefore, among applications of data forensics is the desire to turn a relatively large amount of potentially disjointed and fragmented electronic data into a useful format for a subsequent analysis.
Traditionally, data forensics has proceeded by sequentially stepping through the data on which the analysis is being performed using a processor. Inherent in the need to perform a forensic analysis in the first place is the reality that the data is not understood prior to conducting the forensic analysis. Consequently, a computer hard drive with fragmented data may still be entirely useful, since the location of the various fragments of the files are known, as well as other information about the files known in the art as “metadata”, including file and folder names, associated timestamps, and other fields related to their use. However, when a forensic analysis is needed, many investigative processes may either desire or require the examination of all data sectors of the storage media, regardless of whether they are associated with extant files.
Moreover, the analysis processes are typically not combined and therefore necessitate several passes over the data. However, even if they were combined so that only a single read of the data was necessary, the amount of computation required usually results in total throughput being considerably less than the sustained transfer rate of the disk. For example, although a disk could be read at one hundred (100) megabytes per second, it's quite possible that a keyword search could result in only one (1) megabyte per second of throughput. Searching a one (1) terabyte disk at that rate would require over twelve (12) days of processing time.
It is known in various forms of data computation and analysis, such as searching algorithms, to utilize parallel processing to reduce the total analysis time. In such embodiments, the subject data to be analyzed is broken up into parts and distributed to multiple processors. Each processor, acting independently, then analyzes its part and provides a report. Certain applications, such as those related to Internet searching functions, utilize many thousands of processors in parallel to analyze data.
However, the field of data forensics has long proven unable to incorporate parallel processing in the way that Internet searching has been able to do. While forensics requires a high degree of precision in order to fully analyze the relevant data, Internet searching, for instance, does not require a highly precise understanding of the entire Internet; users of an Internet search engine will not be massively inconvenienced if the occasional website is missed in an analysis of the Internet. Furthermore, the files which are the subject of a forensic analysis may be several gigabytes or more, while a typical website may be a few megabytes or less. Finally, the data to be analyzed forensically may vary widely in format and kind and may be fragmented or corrupted compared to traditional data processing applications, which may tend to consider structured or semi-structured records in limited formats, such as html files on websites, financial transactions in a database, and the like.