This invention relates to a forensic tool for use in retrieval and analysis of evidence stored in computer readable media.
In recent years, personal computers have become a major part of every day life. They are used for e-mail, to run word processing programs, to analyze numbers, and as tools that can aid in the completion of almost any task. They have become common place and are used in business as well as effective tools for use in the home. The migration to personal computers has not been limited unfortunately to honest individuals. Computers have also become tools that are used by criminals to perform any number of tasks. As a result, law enforcement agencies have found it necessary to become more and more familiar with computers and related evidence. Because computer data is stored magnetically and on a variety of storage mediums, computer evidence processing has evolved as a forensic science. Almost all major law enforcement agencies and all military agencies in the United States have developed computer crime units.
As a results of the increased use of personal computers, documentary evidence has transformed during the past several years from paper documents to computer data stored on floppy diskettes, computer hard disk drives, zip(copyright) disks, jaz(copyright) disks and read/writable CD ROMS. These high technology, high capacity storage devices have the potential to store the equivalent of thousands or even hundreds of thousands of printed pages. Additionally, the nature of computer technology has created multiple data storage layers in which potential computer evidence resides in a transitory state.
The existence of much of the data contained on a computer hard disk drive is unknown to the computer user whose work session created the data. As a result, such data has the potential of providing useful information for investigators, internal auditors and others who have an interest in computer evidence issues. Such incidental data, which exists on a storage media as an artifact of the system, rather than by an intent of the user, is referred to as xe2x80x9cambient data.xe2x80x9d The term xe2x80x9cambient dataxe2x80x9d is used below to refer to any large data object of mixed binary and textual content. The information in the ambient data may provide a truer picture of the computer use that the information which the user is aware and can easily modify. The investigator can use leads gleaned from the ambient data to search the data in allocated file space.
Primarily these levels of data storage deal with data that is contained in files, previously erased files (or fragments of such files) and file slack (defined below). Regarding data created by the Microsoft Windows operating environment, relevant data or data fragments potentially exist in what is known as the Windows swap file. Each of these ambient data sources of evidence is discussed in more detail below.
File Slack
Computer storage media is typically divided up into storage units called sectors. Each sectors typically contains 512 bytes of data. For efficiency in managing large storage media, most computer operating systems group one or more sectors into a larger unit, known as an allocation unit or cluster, and allocate an integral number of clusters to each file. The cluster size is determined by the version of DOS or Windows involved as well as the type of hard disk, floppy diskette or storage media involved.
File Slack or slack space is the area between the end of the file and the end of the last cluster that the operating system has assigned to the file. This area is automatically filled with random data from the computer memory by the operating system. File slack may contain information that the computer user believes has been removed from the computer. There will always be some file slack in the last cluster of a file unless, coincidentally, the file size exactly matches the size of one or more clusters. In such rare cases, no file slack will exist at all. File slack is not part of the actual file. The computer user, therefore, does not usually know about the existence of this storage area and has no ability to evaluate the content without specialized forensic software tools. Such tools typically use the file allocation table and directory to compare the true file size with the space allocated to the file to determine the location and size of the file slack. Information found in file slack is useful in internal audits and computer security reviews.
When DOS (or Windows) closes a file, after either creating or updating it, the computer automatically writes one or more clusters to disk. The file slack is created at this time and random data is dumped from the memory of the computer into file slack (the space from the end of the file to the end of the last cluster assigned to the file). By way of example, the storage of data on a computer hard disk drive typically involves cluster sizes that are larger than cluster sizes associated with data stored on floppy diskettes or zip drives. As a result, file slack can potentially be as large as 32,000 bytes. The random data written to file slack can contain almost anything including e-mail messages, passwords, network logons, etc.
Typically the cluster size is one or two sectors regarding files stored on floppy diskettes and this is dependent upon the storage capacity of the diskette involved. In the case of file slack created on large computer hard disk drives, potentially 25% of the hard disk drive""s storage capacity can be occupied by file slack on a xe2x80x98seasonedxe2x80x99 computer hard disk drive. The reason for this is due to the fact that modern versions of DOS/Windows assign large cluster sizes when hard disk drives are involved, e.g. 32k clusters. Normally these huge cluster sizes occur when only one partition is involved on a high capacity computer hard disk drive.
Even when the parent file is deleted, the file slack remains as unallocated storage space until it is overwritten with the content of a new file. Essentially, memory dumps in file slack can remain for years on a floppy diskette or hard disk drive and the computer user is unaware of the existence of the data. It is interesting to note that approximately 8 printed pages of text can be stored in a 32k cluster and depending on the size of the file involved, file slack can occupy much of this space.
Computer data is relatively fragile and is susceptible to unintentional alteration or erasure. This is especially true regarding file slack because it has some unique and interesting characteristics. As long as the file it is associated with is intact, the file slack remains intact and is relatively safe from alteration. However, if the file is copied from one location to another, the original file slack remains with the original file and new file slack is created and attached to the copied file. Disk defragmentation has no effect on the file slack.
Unallocated Space
When files are deleted using conventional DOS or Windows commands or are automatically deleted by programs such as word processing applications, the data associated with the file is not actually deleted. Although the directory listing of a deleted file is removed and the file allocation table is changed to reflect that the space previously occupied by the file is free, the data itself remains on the computer hard disk drive or floppy diskette until it is eventually overwritten with data from new files. However, the normal process of overwriting previously deleted files can take a long time depending on the size of the storage device involved and the frequency of use. The large volume of stored data associated with previously erased files can contain much information of interest to an investigator. The unallocated space will also contain the file slack that was previously associated with the deleted files.
Windows Swap Files
Windows Swap files are a significant source of potential computer evidence when Windows, Windows for Workgroups, Windows 95 and/or Windows NT operating systems are involved. These files are huge and normally consist of several million bytes of xe2x80x98rawxe2x80x99 computer data. Essentially, the Windows Swap file acts as a buffer for use by the operating system as it runs programs, etc. Depending on the version of Windows and the user configuration involved, the files are created dynamically or they are static. Dynamic swap files are automatically created at the beginning of the work session by the operating system and are erased upon termination of the work session by the user. Although a dynamic file is deleted at the end of the Windows sessions, any data from the swap file is available in the unallocated disk space.
Static swap files are created at the option of the user during the initial work session and remain on the disk after the work session is terminated. The user can configure the system for either type of swap file at their option during system configuration. The size of a typical Windows Swap file can be about 100 megabytes. Because the Windows Swap file acts as a buffer for the operating system, much sensitive information passes through it. Some of the information remains behind in the file when the session is terminated. As a result, this file holds the potential for containing a great deal of useful information for the investigator and/or internal auditor. However, the large file size makes reviewing the swap file extremely time consuming. Evaluation of the content of a swap file typically took several hours or even days.
Temporary Files
Windows and other programs create temporary files that can remain after a computing session and contain data valuable to an investigator. Such files typically have a file extension of .tmp and many are found in the Windows or Windows/system directories.
xe2x80x9cBadxe2x80x9d Clusters
The ambient data can be information in sectors that are indicated as unusable in the file allocation table. Most operating systems will indicate that an entire cluster is xe2x80x9cbadxe2x80x9d or unusable if any part of the cluster is unusable. Some of the sectors that comprise the cluster may still contain valid data, that could information useful to an investigator.
.DAT Files
Windows creates .DAT files, primarily in the Windows directory and subdirectories thereof, that are also a source of ambient data. Other programs also create such file.
Data contained in file slack, unallocated space (erased files), temporary files, .dat files, and the Windows swap file usually contains a significant amount of non-ASCII data which cannot be viewed or primed using conventional, text-viewing software applications, e.g., a word processing application, the DOS Edit program, the Windows Write program, etc. Such data is commonly referred to as binary data and some of the bytes involved may mistakenly be interpreted by standard application programs to be control characters, e.g. line feed, carriage return, form feed, etc. The equivalent of hundreds or even thousands of printed pages of data can be stored in this form on a standard computer hard disk drive. The viewing or printing of such data can prove to be a challenge for the computer investigator without proper forensic software tools. The evaluation and processing of binary data was a tedious and time consuming task. Using conventional forensic processes, the evaluation of file slack, unallocated space and the Windows swap file can be measured in days or even weeks. By way of example, a typical Windows Swap file consists of hundreds of millions of bytes of data. It can take several days to properly analyze just one of these files using conventional means.
New Technologies, Inc., the assignee of the present invention, provides tools to law enforcement agencies, corporations, and government agencies that capture the ambient data from file slack and unallocated space and remove much of the binary data from it. There still remains, however, an enormous amount of information that can take an investigator many hours to review. Thus, it has been impossible for an investigator to investigate many computers in a short period of time, as may be necessary, for example, in an organization having many computers that must be checked for evidence with minimal disruption of the work environment.
In accordance with the invention, a tool is provided that permits an investigator, auditor, or security specialist to quickly review large quantities of information that is stored in ambient data on a computer.
Accordingly, it is an object of the present invention to provide a method to permit an investigator to quickly review large quantities of information that are stored in ambient data on computer-readable media.
It is a further object of the present invention to provide a method that allows an investigator to quickly find names, keyboard input, English language sentences, e-mail addresses, and Internet universal resource locators (URLs) in ambient data on computer readable media.
It is still another object of the invention to provide a method using character pattern recognition including inclusionary and exclusionary rules to distill potentially useful investigatory leads from large amounts of ambient data by eliminating information unlikely to be useful.
It is yet a further object of the invention to provide a method for chronicling use of one or more computers.
It is yet another object of the invention to provide a method of removing sensitive information in ambient data from a computer-readable storage medium.
The invention provides a method of quickly and automatically evaluating information in ambient data on computer readable media. The invention presents an investigator a greatly reduced amount of information in which useful investigative leads are concentrated. The invention performs, in effect, an intelligent compression of a large amount of mostly uninteresting data into a much smaller amount of useful information. Rather than merely being a text search engine, the invention excludes data huge amounts of the ambient data from its output by eliminating the majority of information that is unlikely to be of interest to the investigator.
The ambient data is preferably copied to a second computer for analysis to preserve the ambient data on the original media. The non-textual, binary data is removed, and the remaining data is automatically, intelligently analyzed. The analysis seeks patterns in the characters in the ambient data files. The existence of particular patterns in the characters indicates that the characters contain information of a particular type.
The rules for defining patterns include testing for the sequence and proximity of character types, specific characters, or groups of characters, including specific words, names, and abbreviations. Rules can be inclusory or exclusory. The investigator can specify the type of information he is seeking and, by eliminating text that does not fit the patterns associated with the type of information being sought, the output presented to an investigator is greatly reduced in size and includes a high concentration of useful investigative information.
For example, certain patterns of vowels, consonants, numbers, and punctuation are likely to indicate the presence of keyboard input, which may correspond to, for example, passwords. English words typically correspond to a small number of patterns of vowels and consonants and are thereby recognizable. Certain groupings of number represent different types of information, such as social security numbers or telephone numbers. Certain other combinations of vowels and consonants represent keyboard input, but not English language words.
Other patterns represent the presence of English language sentence structure. For example, when the presence of certain punctuation marks are detected, characters in the immediate vicinity can be compared to a word list to determine whether the data includes English language sentences that may be of interest to an investigator.
Another type of pattern represents Internet e-mail addresses and universal resource locators (URLs). Many Internet servers maintain a xe2x80x9cfirewallxe2x80x9d between its data and the Internet to increases the security of information on their Internet server. The firewall assigns alias to individuals behind the firewall, and such aliases are of less use to an investigator than an actual e-mail address that can be associated with an individual of a specific account. The pattern of characters in a firewall alias are typically different from that in a normal e-mail address. By analyzing the patterns in the ambient data, it is possible to identify e-mail addresses that are not firewall aliases and present an investigator only with e-mail addresses and URLs that are likely to have a high concentration of investigative leads.
Another type of character pattern represents names. The invention recognizes first names and nicknames, and then captures data surrounding the first name or nickname to obtain possible complete names.
Another type of character pattern represents certain types of files downloaded from the Internet. Such files include graphics files, such as .GIF, .JPG, and .BMP files, that may contain inappropriate or illegal content and compressed (zipped) files that can contain hidden data. Such files can be recognized by finding particular punctuation and file designations in a particular order and proximity, and then reviewing the characters for the presence of specified words that indicate content of interest.
Other detectable patterns include telephone numbers, social security numbers. The analysis can also include testing not only for the presence, but also of the order and proximity of characters or groups of characters.
In one preferred embodiment, after binary data is removed from the file, different types of character, such as vowels, consonants, letters, numbers, punctuation marks, and certain symbols, are replaced with symbols, such as xe2x80x9cCxe2x80x9d for consonants, xe2x80x9cVxe2x80x9d for vowels, etc. The order of the symbols representing the types of characters are analyzed to determine what the pattern is likely to represent. Content that may be of interest is written to an output file, optionally annotated to indicate why it may be interest.
The invention can be used to identify Internet items. The analysis process can be reduced to minutes and much in the way of Internet activity is stored in the Windows swap file. The same is true of the analysis of file slack and erased file space.
The output of the analysis can be written into a file in a suitable form. For example, if English language text is sought, the output may be written into a text file. A database file format may be more useful as output of other analysis, such as for a list of e-mail addresses, URLs, or names of file and associated times and/or date.
The subject matter of the present invention is particularly pointed out and distinctly claimed in the concluding portion of this specification. However, both the organization and method of operation, together with further advantages and objects thereof, may best be understood by reference to the following description taken in connection with accompanying drawings wherein like reference characters refer to like elements.