The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
Existing methods to recover hidden data files from computational storage devices are tedious and time-consuming. Solid state drives (SSDs or SSD when discussing a singular solid state drive) are complex computational storage devices that use NAND flash memory chips. Such memory devices have a high data storage capacity; however, they are difficult to manage because existing data in a chip cannot simply be overwritten, but rather must first be erased, and then written on again. Furthermore, data must be erased in large blocks; specifically, on the order of a million bytes, but may also be written in smaller blocks, on an order of thousands of bytes. These hindrances pose a problem for forensic analysts and others seeking to recover hidden data because they impose significant time constraints on the process and thus potentially prevent successful hidden data recovery from ever being accomplished.
In an effort to make the above memory chips easier to use and data more accessible, a flash translation layer (FTL) of software is included in the SSD to handle the details of deleting old data and writing new data, thus taking the burden of this task away from the host operating system. A memory array of the SSD has two spaces: a logical block address (LBA) space and a physical block address (PBA) space. These spaces are overlaid spaces. The LBA space is the data structure that the host computer sees and comprises the sectors in which data is stored. The PBA space is the memory provided by flash chips, and is generally up to 20% larger than the LBA space, depending on the particular configuration. The LBA space is mapped into the PBA space by the FTL software.
A legacy hard disk drive (HDD), a similar device, has a simpler configuration in that its LBA and PBA spaces essentially have a corresponding size ratio of one-to-one, with the PBA being only a fraction of a percent larger than the LBA.
The extra PBA space in the SSD is referred to as over provisioning space and has several purposes. These purposes include storing SSD firmware (which is the firmware that runs the SSD's internal microcontroller and which is typically 100K to 200K bytes, though this range is not provided to be limiting), NAND flash wear leveling, bad block management, housing FTL management tables, and garbage collection.
Wear leveling is a type of software algorithm that distributes the reading and writing activities evenly among the flash chips on the SSD. This is needed because NAND flash exhibits rapid wear out mechanisms, resulting in degradation of the data written to the SSD. FTL management tables comprise memory storage for the LBA/PBA mapping table, which can be gigabytes in size. They also include other general task, or housekeeping, information. Garbage collection is a software algorithm which collects and erases currently unused but previously written areas in flash memory in order to prepare clean sections for future writes and avoid delays in erasing. All of the above functions are well known.
A problem occurs as a side effect in the operation of the SSD, which is that forensically valuable data gets moved out of the LBA space to where it cannot be accessed via the host computer interface. The management complexity and non-one-to-one LBA and PBA memory spaces of the SSD (as contrasted from the legacy HDD) further impede successful data recovery and force individuals who want to recover the data, such as forensic analysts, to attempt to reverse engineer the algorithms in the SSD to obtain the hidden data, which can be very time consuming.
Referring now to FIG. 1, a block diagram showing an exemplary relationship between the PBA space and the LBA space in the memory array of an SSD space is shown. Importantly, the PBA space 100 is usually larger than the LBA space 101 (typically by seven percent (7%), but in some instances up to twenty percent (20%)) within the memory array 10 of the SSD. For example, an SSD with a 128 GB storage capacity would have about 119 GB of LBA space available for file storage. The remaining 9 GB would be used as over provisioning space and may be a resource for wear leveling, garbage collection, and other SSD firmware functions. The LBA space 101 is a logical representation, and does not reveal where in the memory devices data is stored. The PBA space 100 consists of memory in the form of physical flash memory chips. As shown, the LBA space 101 is a subset of the PBA space 100.
The data that is stored on SSDs and HDDs is in aggregations known as sectors. A sector is typically 512 bytes in length, but may be larger. The NAND flash chips can accommodate this form of data storage, considered sectors, and so, in most cases, integral numbers of LBA sectors are stored in physical flash memory pages.
There are currently two main methods to read the over provisioning space as a first step to recover hidden data. The first method consists of using custom read commands over the host interface port of the SSD. However, these commands are not standardized and are proprietary, and do not even exist for most SSD models, or are password protected or encrypted. These characteristics make it hard for individuals to access the hidden data.
The second method of reading the over provisioning space consists of reading the flash chips directly. This can be done by removing the flash chips and inserting them into a reading device that reads and stores their contents. To remove the flash chips involves desoldering the flash chips form the memory array. This may also be accomplished via electronic means of reading the flash chips while they remain installed on the SSD circuit.
There are several remaining steps currently required to recover hidden data from an SSD. After the flash memory chips are read and the data is saved as a PBA image, the LBA space is read over the host computer interface and the data is saved as a LBA image. Next, the flash memory errors in the PBA image are corrected, if possible. The error correction information is deleted from the image, leaving only data. The PBA and LBA images are then compared, noting which sectors match in each image. Finally, the unmatched PBA sectors are separated and stored as hidden data.
The described existing process contains several issues. First, the format of the data within a flash memory chip varies greatly depending on the make and model of the SSD, as well as the make and model of the memory chip. This format must be determined before any hidden data can be recovered, which takes time. Additionally, the error correction code (ECC) that is used to prevent flash memory bit errors is typically unknown and is not published by the SSD manufacturer. It therefore may not be possible to correct errors in the raw data that is read from the flash chips. Further complicating matters is that the amount of data may be huge, reaching as high as the terabyte range. This means that the algorithms that are used must be of low complexity and have a low run time for large data sets, which could take days, weeks, or longer to complete. The standard approach to this problem is to represent each sector with a short hash value. For example, the use of an eight-byte hash value for each sector would reduce the data storage requirements by 98.5% compared to handling the raw 512-byte sectors. Provisions would need to be made to handle hash collisions.
However, even with the above hash optimization, the LBA and PBA images still bear no relation to one another due to the wear leveling algorithm used in the SSD, which significantly fragments stored files. This means that an LBA image file that is stored in contiguous sectors will be distributed over a large area of the PBA image with no simple mapping relationship, that mapping relationship being different for every make and model of SSD, as well as changing as a given SSD is used. This means that the matching process could potentially be an order of n2 process, which would be quite slow.
The above process works well when the errors in the PBA image have been corrected. If they have not been corrected, then any bit errors in the PBA sector source data will skew hash values and prevent them from matching to corresponding LBA sectors.
Given the foregoing, what is needed are methods which facilitate identifying and recovering data that is normally hidden in NAND flash memory arrays in SSDs and is normally inaccessible using host computer interfaces, without having to reverse engineer the algorithms in the SSD, using a hash value that is tolerant of some small percentage of bit errors in the source data.