1. Field of the Invention
The present invention relates generally to ZIP™ archive files, and in particular, to a method, apparatus, and article of manufacture for the dynamic manipulation of ZIP™ archive files supporting efficient read-write access for in-place editing, growth, defragmentation and recoverable deletion.
2. Description of the Related Art
Files and folders can consume a considerable amount of memory and storage. The electronic transmission of large files and folders can be slow based on the limited bandwidth of the available transmission medium. Additionally, it is often desirable to archive files and folders. To overcome such size and transmission constraints as well as to support archiving, files and folders are often compressed and/or stored in an archival format. The ZIP™ file format is one such data compression and archival format. However, in the prior art, when any file stored in a ZIP™ file has been edited, the entire file must be rewritten. Accordingly, there is a limited capability and aversion to performing any read-write manipulation of an interior file or subsection of data within an archive. These problems may be better understood with an explanation of the ZIP™ specification and prior art solutions.
Fundamentally, the ZIP™ specification describes a delimited linear arrangement of embedded files each preceded by an informational header and the whole file suffixed by a central directory or table of contents. Such a format is a convenient read-only, write-once archive suitable for collecting hierarchies of files and folders. However, this design does not lend itself well for read-write manipulation of any interior file or subsection of data within. Traditionally, such limitations have been overlooked or left unchallenged as the format has primarily been used as an archival mechanism. Increasingly though, applications are venturing into using openly accessible multi-part formats such as ZIP for native data storage. It is not new that application files are of this nature but the open ZIP format is becoming (if it has not already become) the de facto standard for such implementations. The appeal is that a multitude of free and for-fee applications and libraries exist for reliably manipulating file archives of this format. The specification is open and available and source code exists in the public domain for reference and usage.
In view of the above, some prior art products either do not use ZIP™ files for native data storage and instead write custom solutions, use structured storage (e.g., Microsoft™ structured storage) or simply rewrite the entire archive file with every change. However, such solutions are limited and inflexible.
FIG. 1 illustrates the overall structure of the ZIP format. The local file headers 102/102N provide information relating to the file data 104/104N that consists of the actual compressed file or stored data for the file. The series of local file header 102/102N, file data 104/104N, and data descriptor 106/106N repeats for each file 104/104N in the ZIP archive. The data within the local file headers 102/102N consists of a local file header signature, the version needed to extract the data, a general purpose bit flag, the compression method, the last modified file date and time, the 32 bit checksum CRC32, the compressed size, the uncompressed size, the file name length, and the file name.
The data descriptors 106/106N exists only if bit 3 of the general purpose bit flag is set. It is byte aligned and immediately follows the last byte of compressed data. This descriptor is used only when it was not possible to seek in the output .ZIP file, e.g., when the output .ZIP file was standard output or a non-seekable device. For ZIP64™ format archives, the compressed and uncompressed sizes are 8 bytes each. It includes a 32-bit checksum value to detect the accidental alteration of data during transmission or storage, the compressed and uncompressed sizes of the file data 104/104N.
The archive decryption header 108 is part of the ZIP archive strong encryption scheme and precedes an encrypted data segment (i.e., the archive extra data record 110 and the encrypted central directory structure data 112). The archive decryption header 108 contains information relating to the encryption of the encrypted data segment 110/112 including an encryption algorithm identifier, a bit length of the encryption key, processing flags needed for decryption, etc.
The archive extra data record 110 is part of the ZIP archive strong encryption scheme, immediately precedes the central directory data structure 112 and provides a signature, an extra field length, and extra field data that is used as part of the strong encryption scheme.
The central directory structure 112 consists of a series of file headers that provide the relative offset of each local file header 102/102N as follows:                [file header 1]        [file header n]        
Each file header in the central directory contains versioning information, modification date and times, compression information (for the file header), the compressed and uncompressed file sizes (i.e., of the file data 104/104N), various fields and their lengths (including a file name, extra field and file comment) as well as various file attributes. Lastly, the central directory contains the relative offsets of the local file headers 102/102N.
The ZIP™64 End of Central Directory Record 114 includes versioning information (for extraction of the file data 104/104N), the total number of entries in the central directory, the size of the central directory, and the offset of the start of the central directory.
The ZIP™64 End of Central Directory Locator 116 provides the location (i.e., the relative offset) of the ZIP™64 End of Central Directory Record 114.
The End of Central Directory Record 118 provides the total number of entries in the central directory, the size of the central directory, and the offset of the start of the central directory.
In view of the above, it can be seen, that the location of a file 104/104N is indicated in the central directory 112 which is located at the end of the ZIP™ file. In this regard, each file data 104/104N is introduced by a local header with information such as the comment, file size, and file name. The central directory 112 consists of the headers holding the relative offset of the local headers 102/102N for each file. The end of central directory information 114-118 (which is at the very end of the ZIP™ file) provides the information (i.e., offset) to find the beginning of the central directory 112 so that local file header information 102/102N can be retrieved from the central directory 112.
As can be seen, the above described structure provides a convenient read-only, write-once archive suitable for collecting hierarchies of files and folders. However, there is no capability to modify the ZIP™ file without writing the file from start to finish. In this regard, every time a ZIP™ file is modified, the entire ZIP™ file is required to be rewritten. What is needed is the capability to easily and efficiently perform in-place editing of a ZIP™ file while complying with the ZIP™ file format specification (which is set forth in “APPNOTE.TXT—ZIP File Format Specification, Version 6.3.2, Sep. 28, 2007, by PKWare, Inc.” which is fully incorporated by reference herein).