1. Field of the Invention
The present invention relates generally to computer software applications for data management. Specifically, it relates to systems and methods of digital data identification and the storage, management, and processing of digital evidence in computer systems.
2. Introduction
An increasing number of criminal and terrorist acts and preparations leading to such acts are leaving behind evidence in digital formats sometimes referred to as a “digital fingerprint”. The field of collecting and analyzing these types of data is called digital data identification. These digital formats vary widely and include typical computer files, digital videos, e-mail, instant messages, phone records, and so on. They are routinely gathered from seized hard drives, “crawled” Internet data, mobile digital devices, digital cameras, and numerous other digital sources that are growing steadily in sophistication and capacity. When accurately and timely identified by law enforcement agencies, digital evidence can provide the invaluable proof that clinches a case.
The United States Federal Bureau of Investigation (FBI) has indicated that digital evidence has spread from a few types of investigations, such as hacking and child pornography, to virtually every investigative classification, including fraud, extortion, homicide, identity theft, and so on.
The amount of evidence that exists in digital form is growing rapidly. This growth is demonstrated by the following information which was presented by the FBI at the 14th INTERPOL Forensic Science Symposium. The Computer Analysis Response Team (CART) is the FBI's computer forensic unit and is primarily responsible for conducting forensic examinations of all types of digital hardware and media. For example, according to FBI CART, the number of FBI cases has tripled from 1999 to 2003. This is the result of the increased presence of digital devices at crime scenes combined with a heightened awareness of digital evidence by investigators.
While the number of cases increased threefold from 1999 to 2003, the volume of data increased by forty-six times during the same period. Given the declining prices of digital storage media and the corresponding increases in sales of storage devices, the volume of digital information that investigators must deal with is likely to continue its meteoric increase.
This tremendous increase in data presents a number of problems for law enforcement. Traditionally, law enforcement seizes all storage media, creates a drive image or duplicates it, and then conducts their examination of the data on the drive image or duplicate copy to preserve the original evidence. A “drive image” is an exact replica of the contents of a storage device, such as a hard disk stored on a second storage device, such as a network server or another hard disk. One of the first steps in the examination process is to recover latent data such as deleted files, hidden data and fragments from unallocated file space. This process is called data recovery and requires processing every byte of any given piece of media. If this methodology continues, the number of pieces of digital media with their increasing size will push budgets, processing capability and physical storage space to their limits. Compounding these problems are legal requirements, for example, of providing a defendant in a criminal trial with a copy of the data and retaining the data for the length of the defendant's sentence.
The delay in identifying suspect data occasionally results in the dismissal of some criminal cases where the evidence is not being produced in time for prosecution. Present solutions are efficient for data recovery, but still require manual review from examiners to identify specific data needed to prove guilt or innocence. None of the solutions today provide technologies or methodologies for identifying conclusive digital evidence automatically. Conclusive digital evidence is any digital evidence that can automatically either prove guilt e.g. images of known child pornography, or indicate probable guilt e.g. images of currency plates, driver's licenses, or terrorist training camps that require authentication and/or further review to determine criminal activity. In an effort to reduce the volume of digital files for review, seized digital evidence is processed to reduce the amount of this data. These processes are called “data reduction” by forensic examiners.
A method currently used for data reduction involves performing a hash analysis against digital evidence. A cryptographic one-way hash (or “hash” for short) is essentially a digital fingerprint: a very large number that uniquely identifies the content of a digital file. A hash is uniquely determined by the contents of a file. Therefore, two files with different names but the exact same contents will produce the same hash.
The National Institute of Standards and Technology (NIST) produces a set of hash sets called the National Software Reference Library that contains hashes for approximately 7 million files as of 2004 (www.nsrl.nist.gov).
Files in a hash set typically fall into one of two categories. Known files are known to be “OK” and can typically be ignored, such as system files such as win.exe, explore.exe, etc. Suspect files are suspicious files that are flagged for further scrutiny; files that have been identified as illegal or inappropriate, such as hacking tools, encryption tools and so on.
A hash analysis automates the process of distinguishing between files that can be ignored while identifying the files known to be of possible evidentiary value. Once the known files have been identified then these files can be filtered. Filtering out the known files may reduce the number of files the investigator must evaluate.
Using hash systems to identify conclusive or known suspect files face several challenges. They cannot be used to identify multimedia files (image, video, and sound) that have been altered, whether minimally or substantially. As a consequence an individual using these files to commit crimes escape prosecution.
In addition some law enforcement and intelligence agencies maintain disparate digital fingerprint hash sets, but no such agency currently has a system to create, catalog, and maintain its suspect data files. Although agencies are aware of the known suspect data or files, they do not have a comprehensive management system to catalog and maintain these data.
Digital forensic analysis tools used today are standalone systems that are not coordinated with systems used by the agency analysts and information technology (IT) staff. Agencies do not share information at an optimal level. This has become increasingly important since the terrorists attacks of Sep. 11, 2001, which created a strong demand for greater information sharing between law enforcement agencies. A primary reason this has not been achieved is that there are security risks associated with sharing classified data.
It would be beneficial and desirable to integrate newer, advanced hash technologies to automate the detection and classification process for suspect files and identify altered files. This would allow law enforcement to focus on identifying conclusive data during the forensic process and addresses many of the problems facing digital forensic examinations today. It would also be desirable to enable agencies to manage and share key suspect files and to use a common language to define an investigative strategy and data search.