§1.1 Field of the Invention
The present invention concerns file recovery. In particular, the present invention concerns facilitating the recovery of fragmented files.
§1.2 Background Information
File fragmentation is often an unintended consequence of deletion, modification and creation of files in a storage device. Fragmentation could also be result of a deliberate act by someone to conceal critical electronic evidence. Therefore, a forensic analyst investigating storage devices may come across many scattered fragments without any easy means of being able to reconstruct the original files. In addition, the analyst may not easily be able to determine if a fragment belongs to a specific file or if the contents of the fragment are part of the contents from a particular type of file such as image, video, or plain-text etc.
Digital evidence by nature is easily scattered and a forensic analyst may come across scattered evidence in a variety of situations. This is especially true with the FAT16 and FAT32 file systems, which due to the popularity of the Windows operating system from Microsoft, are perhaps still the most widely used file systems on personal computers. Furthermore, due to the ubiquitous presence of Windows and easier implementation considerations, the FAT file systems has been adopted in many consumer storage media devices, such as compact flash cards used in digital cameras and USB mini-storage devices. The FAT file system however is not very efficient in maintaining continuity of data blocks on the disk. Due to fragmentation, when a file is stored data blocks could be scattered across the disk. Without adequate file table information, it is difficult to put the fragments back together in their original order.
Often critical file table information is lost because they are overwritten with new entries. In fact, the most widely used disk forensics tools like TCT (See, e.g., The Coroner's Toolkit (TCT) at http://www.porcupine.org/forensics/tct.html.), dd utility, The Sleuth Kit (See, e.g., The Sleuth Kit at http://www.sleuthkit.org/.), and Encase (See, e.g., Guidance Software Inc., Encase at http://www.encase.com/.)can recover data blocks from deleted files automatically. However, when the data blocks are not contiguous these tools cannot reassemble the blocks in the correct order to reproduce the original file without proper file table entries. The job of reassembling these fragments is usually a tedious manual job carried out by a forensic analyst.
Another situation, where a forensic analyst comes across scattered evidence, is the swap file. The system swap file is one of the critical areas where lot of useful forensic information can be gathered. The swap file contains critical information about the latest events that occurred on a computer. Therefore, reconstructing contents of the swap file is vital from a forensic standpoint. In order to achieve better performance, operating systems maintain swap file state and addressing information in page-tables stored only in volatile memory. When computers are secured for evidential purposes, they are simply unplugged and sent to a forensic lab. Unfortunately contents of volatile memory are usually lost beyond recovery during evidence collection. Without the addressing information from the page-table, it is difficult to rebuild contents off of a swap file. Again, a forensic analyst is left with a collection of randomly scattered pages of memory.
One of the most popular and naive approach to hiding evidence is to store them in slack space in the file system. Files are assigned certain number of disk blocks for storage. However, not all files fit exactly into the allocated blocks. In most cases files end up using only a portion of their last block. The unused space in this last block is known as slack space. Modifying the contents of slack space does not affect the integrity of data stored in the file system because the read operation does not read data in slack space. A criminal can modify a file hiding program to choose the blocks on which files are hidden based on a sequence of numbers generated using a password. Knowing the password he can reconstruct the original document, whereas a forensic analyst is left with randomly mixed fragments of a document which will need to be reassembled.
Finally, ubiquitous networking and growing adoption of peer-to-peer systems give anyone easy access to computers around the world. There are many peer-to-peer systems which enable users to store data on a network of computers for easy, reliable access anytime, anywhere. Freenet (See, e.g., Freenet at http://freenetproject.org/), Gnutella (See, e.g., Gnutella at http://gnutella.wego.com/.) and M-o-o-t (See, e.g., M o-o t at http://www.m-o-o-t.org/.) are some of the better known systems used by millions of users around the world and many others, such as OceanStore (See, e.g., J. Kubiatowicz and D. Bindel, “Oceanstore: An architecture for global-scale persistent storage,” Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (2000).), Chord (See, e.g., I. Stoica and R. Morris, “Chord: A scalable peer-to-peer lookup service for internet applications,” ACM SIGCOMM 2001, pp. 149-160 (2001).) and Pastry (See, e.g., A. Rowstron and P. Druschel, “Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems,” IFIP/ACM International Conference on Distributed Systems Platforms, pp. 329-350 (2001).), are in development at research laboratories. These systems are designed to provide reliable, distributed, and sometimes anonymous storage networks. A criminal can use these very systems to hide software tools and documents that might be useful for his prosecution, just as easily as any other user can save a file. Most peer-to-peer systems associate a unique key, either assigned by the user or generated automatically, with each document they store. Hence, a person can split a document into fragments and store each fragment in a peer-to-peer system using a sequence of secret phrases as keys, such that he can easily splice the fragments together knowing the proper sequence of secret phrases. For instance, in Freenet one can assign each fragment a unique URL. Since URLs are user friendly keywords it is easy to recall the proper sequence to retrieve and splice the fragments together. It is, however, difficult to reconstruct the original document without the knowledge of the proper sequence even if the keywords are known.
As can be appreciated from the foregoing, digital evidence can easily take a variety of forms and be scattered into hundreds of fragments making reassembly a daunting task for a human analyst.
The problem of reassembly of file fragments differs, in many ways, from the reassembly of fragments of physical objects, like shards of pottery or jigsaw puzzles. First, file fragments do not have a set shape as they are simply consecutive bytes of a file stored in a disk. Therefore, known shape matching techniques are not very useful for reconstructing fragmented files. For example, FIGS. 2A-2C show three (3) potential reassembly sequences of an image consisting of five (5) eight-unit fragments. Each fragment is the same size in this example. As shown, the shape of the fragment depends on where it is used for reconstruction. Further, a fragment in the reassembly of physical object fragments and jigsaw puzzles may potentially link to multiple fragments, while file fragments will typically be linked to at most two other fragments (one above and one below, or one before and one after). Furthermore, since fragments of physical objects will often have large edges, edge information can be used for reassembly. File fragments, on the other hand, will often have relatively small edges. Consequently, the edges provide less information for reassembly.
The present inventors have presented techniques for reassembling fragmented documents, such as images. (See, e.g., K. Shanmugasundaram and N. Memon, “Automatic Reassembly of Document Fragments via Data Compression,”2nd Digital Forensics Research Workshop, Syracuse (July 2002); A. Pal, K. Shanmugasundaram and N. Memon, “Automated Reassembly of Fragmented Images,” ICASSP, (2003); and K. Shanmugasundaram and N Memon, “Automatic Reassembly of Document Fragments via Context Based Statistical Models,” Annual Computer Security Applications Conference, (2003). Each of these papers is incorporated herein by reference.) In particular, the present inventors have described a greedy heuristic that, starting with the header fragment, reconstructs each file one fragment at a time. More specifically, the header fragment is stored as the first fragment in a reconstruction path P of the file and is then set as the current fragment “s”. After selecting a fragment s, the fragment's best successor match “t” is chosen. The best match is based on the best candidate weight as determined by a weight calculation technique or any other metric. The fragment t is then added to the reconstruction path P and becomes the latest current fragment. This process is repeated until the file is reconstructed. The pseudo code for the greedy heuristic is:
Greedy (currentFragment , availableFragment s [ ]){  for ( x=1; x < availableFragment s . size; ++x ){    bestMatch= get best < x > fragment f o r currentFragment    if (bestMatch found in availableFragment s [ ])      return bestMatch ;  }}
In particular, the inventors have described a technique called Greedy Unique Path (“UP”), which is a sequential algorithm using the greedy heuristic. During reassembly using Greedy UP, if a fragment is assigned to a file reconstruction, it will be unavailable for use in the reconstruction of any other files. Let Pi be the reconstruction path of file i and the header fragment for i be identified as hi. To start, the header hi is chosen as the first fragment in the reconstruction path (i.e. assign Pi=hi). The current fragment is set equal to the header s=hi, and then the best available greedy match t for the current fragment s is found. The best available match is the best match t for s that has not been used in any another file reassembly. The fragment t is placed in the reconstruction path (Pi=Pi∥t) and is then set as the current fragment for processing (s=t). The best match for the new current fragment is found, and the process is repeated until the file is reconstructed. Subsequent files are then reconstructed in the same manner until all k files have been reassembled.
Although the greedy UP fragmented file reconstruction technique creates vertex disjoint paths, the paths might depend on the order of files being processed. That is, changing the order in which the files are processed may result in different reassembly results. For example, referring to FIG. 6 of the '370 provisional, with Greedy UP, both the image of the dog and plane will reconstruct perfectly if the dog is reconstructed first and then the plane 6(b). If, however, the plane is reconstructed first it will reconstruct incorrectly thus causing the dog to reconstruct incorrectly 6(c). This is because some of the fragments of the dog will be assigned to the plane, and then the dog will reconstruct incorrectly because those fragments of the dog assigned to the plane will not be available.
As can be appreciated from the foregoing, it would be useful to facilitate the reconstruction of fragmented files, either in a totally automated fashion, or in a semi-automated fashion using user feedback. It would be useful if such techniques would improve upon the greedy UP technique.