The data produced by an information source may be viewed as a random realization produced from a certain probability distribution that is a unique characteristic of that particular source. Different sources will produce realizations of the data from distinct underlying probability distributions.
An information source is said to be producing sparse data if a typical realization of its data, when transformed by a fixed orthonormal transformation that is a characteristic property of that source, consists of only up to s non-zero values. The source is then said to be “s-sparse under that orthonormal transformation” or “s-sparse in the basis of that orthonormal transformation”. As a special case, a source can be sparse under the identity orthonormal transformation which leaves the data unchanged, and in such a case the source is said to be “s-sparse its own domain”.
For example, if the source produces vectors of dimensionality 10000, that is, vectors having 10000 elements, but a typical realization of the vector has only up to 10 elements with a non-zero value, then that source may be considered to be sparse, or more accurately 10-sparse, in its own domain. On the other hand if a typical realization of the vector, when transformed by the Fourier transform, has only up to 10 non-zero entries, then the source is said to 10-sparse in the Fourier or frequency domain. It is important to note that it is not generally known a-priori which elements of a realization, in its own domain or after a fixed transformation will be non-zero. It also may not always be known a-priori what the associated orthonormal transformation is. Typically, only the sparsity of the source, s, or at least an upper bound on it, is known with some certainty.
Although sparsity is, strictly speaking, a property of a random information source, it is an accepted terminology in the field to say that its data is sparse, where the data is implicitly presumed to be a random variable. It is not meaningful to talk of the sparsity of a single deterministic realization of data, since any deterministic realization is always sparse in its own basis.
A characteristic of sparse data is that it may be easily compressed. The compressed data may be used as a signature of the data for data analysis purposes, or may be subsequently de-compressed, effectively recreating the original sparse vector, prior to use.
A common example of compression is that of compressing an image. The image date may be compressed prior to transmission over a network and later decompressed for display without impacting, or having an acceptable impact on, the information to be conveyed, that is the image. The compressed image may be considered a signature of the image and may be used as a representation of the data. For example the compressed data of an image could be used as a fingerprint of the uncompressed image.
It is desirable to have a technique of generating a compressed representation of a high dimensionality sparse data that does not require huge memory allocation in order to calculate the compressed representation. Moreover, if the data is sparse in its own domain, it is desirable to exploit this property to reduce the number of computation need in computing the signature to O(s), as well.