The data produced by an information source may be viewed as a random realization produced from a certain probability distribution that is a unique characteristic of that particular source. Different sources will produce realizations of the data from distinct underlying probability distributions.
An information source is said to be producing sparse data if a typical realization of its data, when transformed by a fixed orthonormal transformation that is a characteristic property of that source, consists of only up to s non-zero values. The source is then said to be “s-sparse under that orthonormal transformation” or “s-sparse in the basis of that orthonormal transformation”. As a special case, a source can be sparse under the identity orthonormal transformation, which leaves the data unchanged, and in such a case the source is said to be “s-sparse its own domain”.
For example, if the source produces vectors of dimensionality 10000, that is, vectors having 10000 elements, but a typical realization of the vector has only up to 10 elements with a non-zero value, then that source may be considered to be sparse, or more accurately 10-sparse, in its own domain. On the other hand if a typical realization of the vector, when transformed by the Fourier transform, has only up to 10 non-zero entries, then the source is said to 10-sparse in the Fourier or frequency domain. It is important to note that it is generally not known a-priori which elements of a realization, in its own domain or after a fixed transformation, will be non-zero. It also may not always be known a-priori what the associated orthonormal transformation is. Typically, only the sparsity of the source, s, or at least an upper bound on it, is known with some certainty.
Although sparsity is, strictly speaking, a property of a random information source, it is an accepted terminology in the field to say that its data is sparse, where the data is implicitly presumed to be a random variable. It is not meaningful to talk of the sparsity of a single deterministic realization of data, since any deterministic realization is always sparse in its own basis.
A characteristic of sparse data is that it may be easily compressed and used as a signature of the data for data analysis purposes. Data may also include repetitive or synonymous information resulting in increased memory requirements and computations required to generate the signature. It is desirable to have a technique for generating a compressed representation of high dimensionality data that does not require a large memory allocation to pre-calculate and store data required to generate the signature and in particular to a technique that utilizes the properties of the synonymous information therein.
Therefore there is a need for an improved method and computing device for signature representation of data with aliasing across synonyms.