Large quantities or volumes of raw data can be generated and manipulated by today's computers. Yet, the large volumes of raw data may be meaningful only when the data is mathematically simplified (e.g., classified) or described by models. The models may describe the raw data with a reduced number of variables, and thereby allow the volumes of raw data to be advantageously compressed and processed. The modeling of data can be based on theory (e.g., physics) or on statistical analysis. Common statistical analysis methods that are used for data modeling include maximum likelihood estimation, principal component analysis (PCA), and discriminative methods.
Recent progress in discriminative learning, support vector machines and regularization theory have encouraged the view that a data model may be estimated or learned from the raw data by minimizing penalty functions, e.g., linear classification constraints, on the model. An important aspect of these formalisms is that their learning algorithms generate solvable convex programs having convenient computational properties. Traditional machine learning methods have focussed on estimating models (generative or discriminative) from vectorial data. However, non-vectorial data or non-Euclidean data such as strings, images, audio and video require invariance and representation learning to recast the data in a form from which a useful model can be learned.
The proper representation for images and visual data is critical for computer vision applications. The initial manner in which visual or image information is parameterized, image features are extracted, and images are mathematically described or otherwise specified is an active area of current research. The success of subsequent computer vision application modules (e.g., for image recognition, segmentation, tracking, and modeling) often rests on the initial representation chosen. Image invariance has been successful exploited to represent data for computer vision applications. For example, in image recognition or matching modules, the solution of the so-called correspondence problem utilize the permutational invariance of pixels in images. See e.g., S. Gold, C. P. Lu, A Rangarajan, S. Pappu, and E. Mjolsness, “New algorithms for 2D and 3D point matching: Pose estimation and correspondence,” Neural Information Processing Systems 7, 1995. Further, flexible specification of permutational invariance and other image invariants and their reliable estimation remains an important goal of machine learning in computer vision applications.
Prior attempts in uncovering invariant representations often involved particular iterative methods, which unfortunately are incompatible or inconsistent within subsequent model estimation algorithms. Further, these iterative methods also suffer from local minima, which lead to false or unreliable results. For example, B. Frey and N. Jojic, “Estimating mixture models of images and inferring spatial transformation using the EM algorithm,” Computer Vision and Pattern Recognition 1999, describes using learning transformations for generative models that require an iterative Expectation Maximization (EM) or variational implementation, and also a discrete enumeration of all possible transforms. Similarly, other known iterative techniques such as congealing may uncover image rotations, but also suffer from local minima and do not scale to model estimation frameworks (e.g., discriminative model frameworks). Further, for example, known correspondence algorithms for image registration and alignment also suffer from local minima problems and require relaxation or annealing.
Consideration is now being given to improving methods for uncovering invariant representations of data and to improving statistical methods for estimating models for the data. The data types considered include image and video data and also other data types (e.g., alphanumeric data, and audio, etc.). Attention is particularly directed to data analysis for image processing applications.