Patterns of data are described using models, which allow data processing methods to remove information redundancy for lossless and lossy data compression, and reduce the number of calculations required for pattern recognition, data generation and data encryption. Four basic data modeling techniques exist in the known art, which are statistical modeling, dictionary coding, combinatorial coding and mathematical functions.
Statistical modeling determines a probability for a state or symbol based on a number of times the symbol occurs. The probabilities are recorded in an index, which can then be accessed by a decoder to decipher the message. An encoder can generate a more efficient code by implementing the model. It is why Morse code uses short codes for “A, E, I and U” and long codes for “X, Y, Z” or “0-9”, for the vowels in the English alphabet are modeled with a higher probability than consonants and numbers.
Assigning probability values also enables a pattern recognition method to reduce a number of states to select from when matching or recognizing data. For voice or speech recognition, the Hidden Markov model is used, which can use an index to assign probabilities to the possible outcomes.
The second technique typically used for modeling data is the dictionary coder, which records patterns of strings by assigning a reference to the string's position. To eliminate redundancy, the number of references must be less than the number of possible patterns for a string, given its length. The references substitute for the strings to create an encoded message. To reconstruct the message, the decoder reads the reference, looks it up in the dictionary and writes the corresponding string.
The decoder must access the statistical index or dictionary to decode the message. To allow for this access, the index/dictionary can be appended to the encoded message. Appending the index or dictionary may not add much to the compressed code if a relatively short number of states are modeled. If the number of patterns in the index or dictionary is too large, then any advantage gained by compressing the data can be eliminated after the statistical index or dictionary is appended.
In the known art, an adaptive index or dictionary can be used to solve the problem of appending it to the encoded message. In adaptive modeling, both the encoder and the decoder use the same statistical model or dictionary at the start of the process. It then reads each new string and updates the model as the data is being encoded or decoded. This helps to improve compression ratios, for the model is not added with the message.
Two main problems exist when using an adaptive index and dictionary. The first is that it is relatively inefficient near the beginning of the data stream. This is due to the fact that the encoder starts with using a small number of patterns in the model. The smaller the number of patterns modeled, the less accurate the model is relative to the size of the message. Its efficiency may improve once the number of patterns increases as the adaptive model grows. A more significant problem is that, like the static index or dictionary, the adaptive model must be constructed and stored in memory. Modeling more patterns can deplete memory resources when the index or dictionary becomes too large. More calculations are also required to update the index/dictionary. The third problem is that adaptive modeling is more computationally expensive compared to static indexes or dictionaries because the model must be constantly updated as the decoder decodes the message. For example, using adaptive compression with Huffman tree codes requires the data processor to continuously update the nodes and branches of the code data tree as it encodes/decodes. For arithmetic coding, updating the probabilities for each symbol pattern requires updating the counts for all the subsequent symbol patterns as well. This can take a considerable amount of time for the processor to calculate the probabilities for a large index, for the number of possible patterns rises exponentially with each bit added. The adaptive technique can therefore slow productivity of a device requiring frequent encoding/decoding of data, such as medical data, audio, video or any other data that requires rapid access. This can be especially problematic for mobile devices, which typically hold less memory and processing power than personal computers and database machines.
The third modeling technique involves combinatorial encoding. As exemplified in U.S. Pat. No. 7,990,289 to Monro titled “Combinatorial Coding/Decoding for Electrical Computers and Digital Data Processing Systems” filed Jul. 12, 2007, combinatorial encoding counts a number of times a symbol appears in a sequence and generates a code describing its pattern of occurrences. This method can be effective for text documents where there is usually a statistical bias in the number of counts for each symbol or, when the numbers of occurrences are predetermined or known to the decoder. This statistical bias may be used in combinatorial coding to compress data.
A problem with this method is that the effectiveness may lessen if there is no statistical bias or, when the numbers of counts are relatively equal and unknown. When the counts for each symbol reach parity, the number of combinations is at its highest, resulting in very little compression. Also, the number of occurrences of each symbol needs to be, like an index/dictionary, accessible to the decoder in order to decode the encoded message describing the pattern of occurrences. Like the problem associated with appending dictionary encoders, any compression gained by encoding the pattern of occurrences can be nullified if an index describing the number of occurrences for each symbol is too large.
The fourth modeling technique involves signal processing, which includes the modeling of waveforms and time series data using mathematical functions. Such methods usually extract three basic patterns of the data; trends, cycles and white noise. Trends describe a gradual tendency for the signal to increase or decrease in time. Cycles describe repeating patterns in the data, such as frequencies. White noise is considered random-like data, which offers no discernible patterns. These three pattern types can be calculated within a time, frequency or time-frequency domain. Such pattern extraction techniques include autocorrelation, Fourier analysis and wavelets. By using mathematical functions to decipher patterns of a signal, the inverse of the functions can either approximate the signal or reconstruct it. These models can then be used for analyzing and forecasting stock prices and, for lossy data compression.
The problem associated with using mathematic functions as a model is that the functions tend to identify only general properties of the data. Reconstructing finer details of the signal is computationally expensive. Secondly, they offer no known way of generating a probability for a unique data sequence deterministically, for probabilities are not incorporated into the calculation. Probability values are required to measure the information entropy of a sequence. Therefore, these techniques are generally used for approximating signals using lossy data compression or forecasting; not for lossless data compression, particularly involving literal data such as text or machine code.
One may see that a fundamental problem for all of the four modeling techniques is that they are all memory and computationally expensive whenever their models describe the probabilities for large numbers of states, long sequences of data or, data with high entropy. Modeling a large number of states increases the model's efficiency, but it also takes a toll on the data processor and memory. For example, with pattern recognition, the Hidden Markov model, coupled with a dynamic programming technique, can be computationally intensive the more outcomes there are to solve for. The only way to reduce the computational complexity of pattern recognition and data compression using the modeling techniques in the known art is to reduce the number of patterns to model for.
Lossy encoders, however, attempt to encode the general meaning of a message similarly to signal analysis. For example, JPEG compression methods analyze the brightness and chrominance of each pixel and attempts to find statistical redundancy for those values. Because humans perceive more detail in the brightness of an image rather than in its hues, the chrominance can be down sampled, which eliminates some of the information by using a statistical model tailored to the way humans perceive light.
Lossy encoding generally achieves higher compression ratios than lossless methods because the finer details of the message are not as important as the message's broader values. The human visual perception is not based on a pixel by pixel scan of an image, but on a wide view of the image. This is also the case with sound and language. Humans can usually understand a sentence without requiring each letter or word to be accurate. Most people will be able to understand the following sentence in the English language, without all the letters or words being accurate: “I lov yu.”
The problem with lossy compression is that it sacrifices information in order to encode or process the data. If too much information is sacrificed, the image, video or sound quality is degraded, which is undesirable for many applications, such as high definition video or audio. Even with lossy compression, the amount of data to store, process and transmit for video/audio files can reach into the trillions of bytes. Mathematical functions used in many lossy encoders/decoders require the use of graphic accelerators, faster data processors, larger amounts of memory and higher bandwidth, which is not always feasible, especially for mobile and embedded devices. Another problem with lossy compression is that it cannot be used for all types of data. Every byte of an executable must be precise and without loss, otherwise, the intended meaning, which are the machine instructions, cannot be accurately processed by the data processor.
There are a wide variety of techniques in the known art that use visual aids to identify data patterns. For example, a time series chart plots data values within a two dimensional graph, which enables human beings to see the structure of the time series over a period of time. The goal of the time series chart is to identify patterns in a visual way. Lines, curves and moving averages may be used to plot the data points in the time series within the graph. Models are then fitted to the data points to help determine various patterns.
A problem using charts to model patterns is that they only tend to use two or three dimensions, for they are used for the aid of a human being. They are typically not used for determining the probabilities of sequences and other characteristics in more abstract spaces, such as topological spaces, non-Euclidian spaces, or in spaces with four or more dimensions. Much of the data processing methods in the known art still process data as a sequence of variables, not as a shape.
Models are also used in computer simulation to generate data. It is not trivial for a data processing machine to simulate true randomness, though it can generate pseudo-randomness, which is a simulation of randomness using techniques such as a stochastic or Markov process. This is a problem for encryption, where random data is used to help encrypt a digital message for security purposes. To solve this problem of simulating random data by a data processor, known methods have used natural phenomena, such as weather storms or radioactive decay to assist with generating randomness. The problem with using natural phenomena to generate random data is that a data processing system is required to have additional machines that incorporate data from the natural phenomena, which may not always be available or practical.
One of the biggest problems in the known art regarding data processing is the theoretical limit of data compression. Shannon's entropy states the more unpredictable an outcome is, the more data required to describe it. Information theory treats symbols of a message as independent and identically distributed random variables. This means that one symbol cannot affect the probability of another, for they are independent; unconnected. For example, in information theory, the probability of a fair coin landing on heads or tails is considered to be 0.5 and remains constant for each flip, no matter how many flips are made. Using Shannon's entropy, it is considered to be impossible to encode, on average, the outcome of a fair coin better than 1 bit. Compression methods are incapable of compressing random-like data because its average entropy is at maximum. Therefore, when redundant data is eliminated from a message, the probability distribution associated with the variable usually turns to a normal distribution, for all possible outcomes are considered equally probable. A theoretical limit of compression and computation of data is generally accepted in the known art.
In fact, it is not possible to compress all possible files or states using a single statistical model, as stated by the Pigeon hole principle, for there cannot exist an injective function that can take a large finite set to a smaller set. In other words, four pigeons cannot fly through three holes at the same time. When variables are considered to be mutually independent and all their possible states are treated as equally likely, then all possible sequences comprising mutually independent variables are also equally likely. This relates to the accepted idea that random data is a condition when all possible sequences are equally probable and is unable to be compressed without loss. Data compression methods in the known art are left at an impasse. This is the case for all high entropy data, such as: a binary executable, data simulation, compressed data, encrypted data or simply random data based on a source of natural phenomena, such as radioactive decay.
U.S. Pat. No. 6,411,228 to Malik titled “Apparatus and method for compressing pseudo-random data using distribution approximations” filed Sep. 21, 2000, however, describes a method and apparatus that claims it can compresses pseudo-random data by implementing a stochastic distribution model. In its claims, the stochastic distribution model is compared with the pseudo-random data. The difference data between the two is claimed to be generally less random than the other stochastic distribution and the pseudo-random data. It is claimed in the patent that the difference data can therefore be compressed. The difference data is included with the values required to generate the stochastic distribution, such as a seed value, which together allow a decoder to generate the original pseudo-random file.
The problem with this method is that the process that compares the stochastic model to the pseudo-random data is computationally expensive, for the process must compare a large number of stochastic models in order to find a “best” fit, which is a selection process that leads to the method generating the difference data. Another hurdle using stochastic models for encoding pseudo-random data is that the number of bits needed to describe the seed value to generate the pseudo-random data may also be as high as the number of bits required to describe the pseudo-random data itself. In addition, the stochastic models may not always match random data well enough, for it is generated by computer simulation and not from natural phenomena.
What is needed is a method, article comprising machine instructions and apparatus that can efficiently model the statistics of large sequences of data, analyze their patterns from a broad view, determine their probabilities, eliminate their redundancy and reduce the average entropy without loss. Because statistical models are the starting point for most data processing techniques, any way that allows a data processor to reduce the overall complexity of said models would result in an increase in speed and accuracy of data processing, such as for pattern recognition of human language, pictures and sounds. It would also allow for the transmission and storage of more data in less time, bandwidth, and space, as well as allow for the determination of a probability value for a sequence at a future time using forecasting, and for efficient data generation, such as random data without requiring devices to read naturally chaotic phenomena found in nature.
Unless the context dictates the contrary, all ranges set forth herein should be interpreted as being inclusive of their endpoints and open-ended ranges should be interpreted to include commercially practical values. Similarly, all lists of values should be considered as inclusive of intermediate values unless the context indicates the contrary.