The present invention relates to digital signal processing. It provides a representation of digital signals which can be used in a variety of applications including pattern detection, clustering, classification and registration.
For one dimensional or multidimensional signals, pattern detection, clustering, classification and registration problems require computing a reliable distance that measures the similarity of signals despite the existence of deformations.
For example, if two signals are translated, their Euclidean distance may become very large despite the fact that they are identical up to this translation. In images, such deformations include non-rigid translations, rotations, scaling (zoom in and out). For sounds, it includes frequency transpositions and scaling.
In pattern detection applications, the potential location of a family of predefined patterns in a given signal, where they may appear with deformations, is looked for. Detecting a human face in an image, despite its variability due to different morphologies, pose and/or scaling, is an example of pattern detection problem which requires using a distance that is not sensitive to such deformations.
A clustering problem in a signal involves segmentation to group signal patches that are similar relatively to some distance. Texture segmentation is an example of clustering problem which consists in dividing images into homogeneous texture regions. A clustering problem may also apply to several signals which need to be grouped in homogeneous classes.
A signal classification problem consists in finding the class to which a signal belongs. Classifying signals such as sounds, images, video signals or medical signals requires some measurement of their similarity. Signals that belong to the same class may differ by some important deformation. Such deformations should not alter their similarity distance to classify them appropriately.
Pattern registration is about recovering a deformation that maps a signal onto another signal. In stereo images, this deformation, also called disparity, carries depth information on the scene. In videos, such deformation provides the optical flow. In medical imaging, recovering this deformation allows registration of medical data to analyze potential anomalies.
Standard distances such as the Euclidean distance do not measure the similarity of signals when these signals have undergone some deformation, such as a translation. Indeed, the Euclidean distance between a signal and its translated version is often very large. Instead of measuring a Euclidean distance between signals, one may apply such a distance to some signal representation that is constructed to preserve important signal information while being insensitive to other signal properties. A major difficulty of signal detection, clustering, classification and registration then becomes to compute an appropriate signal representation over which state of the art detection, clustering, classification and registration algorithms may be applied using state of the art distances such as Euclidean distances.
Spectrograms are widely used in the field of speech recognition. They include removing the complex phase of the Fourier transform of an input audio signal restricted by a time window function. Indeed, the complex phase of a Fourier transform can be interpreted as a translation parameter which is removed by the complex modulus operation. However, spectrograms are local Fourier transforms and are thus sensitive to translations that are large relatively to the window size and to other deformations such as a signal scaling.
Multiscale transformations, such as wavelet transforms, have also been used for pattern recognition. In particular, a so-called scalogram, obtained with a complex wavelet transform followed by a modulus computation is a representation which is used for pattern recognition in U.S. Pat. No. 7,171,269. However, at fine scales wavelet coefficients are sensitive to translations and at coarse scales they do not carry enough information to discriminate signals.
In “A Steerable Complex Wavelet Construction and Its Applications to Image Denoising”, IEEE Transactions on Image Processing, Vol. 14, No. 7, July 2005, pp. 948-959, A. A. Bharath, et al. disclose a pyramid structure for a filter bank used for sub-band decomposition. The basic pyramidal unit has an isotropic lowpass filter followed by a downsampler, and complex bandpass filters. The structure extends over successive sub-band decomposition stages, the subsampled lowpass component being passed to the following stages. A similar kind of pyramid structure is presented in Portilla, et al., “A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficient”, International Journal of Computer Vision, Vol. 40, No. 1, October 2000, pp. 49-71.
Neural networks are also used in the context of pattern recognition. Multilayer neural networks define a cascade of linear and non-linear transformations that can construct efficient signal representations for pattern recognition. In particular, the convolution networks disclosed in “Large-Scale Learning with SVM and Convolutional Nets for Generic Object Categorization” (Fu-Jie Huang and Yann LeCun, Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1, p. 284-291) perform a cascade of convolutions with real filters followed by some linear operations that typically include an absolute value and a sigmoid transformation. However, these neural networks require a heavy training phase to adjust the filters to the type of patterns encountered in the recognition task.
Accordingly, there is a need to find a generic signal representation that builds strong invariants to various deformations. It is desirable to implement such signal representation using standard band-pass filters which do not depend upon the specific pattern properties, and to avoid a learning stage.