There are a large variety of software programs that are currently in existence and or being developed to direct computer systems in performing countless tasks. In particular, commercial software products are available that include one or more application. For example, word processors, spreadsheets, and database management systems are applications software. Some products also combine an operating system and may be used to deliver services without requiring any additional software once installed on the computer's hardware platform.
Generally, software products include a set of files with binaries and data, e.g. raw facts. The binaries alone, i.e. strings of 0's and 1's, exclusive of the data, of the software product are considered a software package. These binaries are in an executable file (also referred to as binary data file) containing a program that is capable of being executed by a computer. Unlike a source file, an executable file has binaries and is not readable by humans. To transform a source file into an executable file, the file contents must be passed through a compiler, interpreter or assembler. It is these binaries that fill a computer's memory.
The executable file may store a list of instructions, data, information for a dynamic linking and/or information for debugging. An instruction part of an executable file is a basic operation, composed of an operation code (hereinafter referred to as “opcode”) and optionally one or more parameters.
The executable file may also include one or more symbols that reference resources inside of the file. Often, there are multiple binaries that need to work together at execution time. For example, in modular system environments both the operation system layer and the application layer are split into kernel, drivers, applications, shared libraries and/or plug-in's. Each binary publishes a list of symbol names (i.e. stream of bytes of variable lengths) as unique identifiers to be used for dynamic linking, where symbols exported by one binary are matched with symbols imported by another binary. Exported symbols indicate resources defined in the file and are made visible to other files, whereas imported symbols are not defined in the file, but may be found in other files.
There are several concerns inherent to executable files. One issue is that executable files often consume an enormous amount of storage space within the system. Thus, it is desirable to reduce the information that is stored on the computer systems. Reducing the storage footprint of information is among the oldest challenges to software engineering.
Although the rapid pace of technological progress provides us with relatively inexpensive and large storage devices, the storage spaces are not always large enough for the desired amount of information. The human imagination will always find ways to develop software with requirements that exceed hardware capabilities and consequently increasing the need to more efficiently store information.
As a result of the demand to shrink files, there is much interest in a variety of compression techniques, such as loss-less compression. Loss-less compression is the art of reducing the storage size of data without losing any information. Loss-less compression tools are designed to recognize common patterns from a general pool of data, but are not drawn to make assumptions based on specific properties and regularities of the data source.
Compression schemes are most efficient when the binaries have certain distinctions. If the binaries exhibit well-defined characteristics, then a compression scheme that is most suited to the characteristic may be chosen. Typically, software packages use a specific binary file format, for a specific processor, as generated by a particular compiler or linker tool. However, most binaries have a combination of characteristics, making it difficult to select a suitable compression scheme.
Where information is decomposed into its constituents, the coding technique best suited to each constituent may be applied to improve compression performance. Unfortunately, prior techniques do not efficiently process and organize binary code because individual slices of a file are compressed without considering the contents of the file as a whole. Usually, the context of each slice is too small to determine repetitions or patterns that may be useful in optimizing compression.
Furthermore, patterns exist between multiple files, as well as inside of a single binary. However, previous compression schemes do not leverage these patterns between binaries to allow global optimization between multiple binaries. These prior systems also do not extract language specific information. Nor do these other systems eliminate sections that are not required to load the executable file into memory, e.g. debugging information, etc. Thus, the stored files contain information that needlessly consumes precious space within a device.
It is often thought that the specific structure of each executable file needs to be respected because it is assumed that random access will be required to read and write the contents of an uncompressed binary file. Traditional approaches to file compression are based on this presumption that files may need to be read from or written to at any random position, on any length, and in any order.
On the contrary, to this assumption that random access is necessary, many executable files are only read from or written to in one block. Furthermore, these files are only written once, at the time of creation, and will never be modified after development. They will only be deleted or fully replaced by updated versions. These fixed files will always be passed to a binary loader as one raw block of data and placed into memory in one operation by the loader. Thus, the structure of a fixed executable file need not be preserved but may be processed and reorganized without compromising the utility of the binaries.
Another problem with executable files is that they may be susceptible to persons maliciously breaking into a computer system. There are many individuals who possess sufficient technical knowledge to understand the weak points in a security system. It is of crucial importance that executable files provide security measures that deter hackers from gaining unauthorized access to computer systems for the purpose of stealing or corrupting data.
In light of the shortcomings of the various currently available systems, there is still a need for optimization of compression across multiple binaries of a software package. In particular, there is an interest for a compression system that organizes the contents of binaries according to patterns and eliminates unnecessary sections. Moreover, the system should provide security measures to reduce vulnerability to hackers or crackers.