As is known, methods and systems implemented in computer vision to process images, extracting hierarchical features from data, are not currently applicable with the same results in domains such as computer graphics, where 3D shapes (manifolds) need to be processed, or computational sociology, dealing with networks (graphs).
Deep learning methods have recently significantly impacted many domains. Nowadays, deep learning methods are already widely used in commercial applications, including Siri speech recognition in Apple iPhone, Google text translation, and Mobileye vision-based technology for autonomously driving cars.
Deep learning refers to learning complicated concepts by a machine, by means of building them out of simpler ones in a hierarchical or “multi-layer” manner. Artificial neural networks are a popular realization of such deep multi-layer hierarchies inspired by the signal processing done in the human brain. Though these methods have been known since the late 1960s, the computational power of modern computers, availability of large datasets, and efficient stochastic optimization methods have led to creating and effectively training complex network models that have made a qualitative breakthrough in performance.
Computer vision perhaps has been affected most dramatically by deep learning. Traditional approaches in this domain relied on “hand-crafted” axiomatic or empirical models. It appeared that constructing axiomatic models for increasingly complex concepts is nearly impossible, while at the same time, the growth of publicly available image data allowed “modeling by example”. Simply put, while it is hard to determine what makes a dog look like a dog, one can get millions examples of dog images and use a sufficiently complex generic model to learn the “dog model” from the data. The work of Krizhevsky et al., achieving unprecedented performance ImageNet benchmark in 2012, has provoked a sharp resurgence of interest in deep learning methods. Deep learning methods have been since applied to practically any problem in computer vision, almost invariably outperforming the previous approaches. Overall, this sequence of successes has brought on an overwhelming trend in the community to abandon “hand-crafted” models in favor of deep learning methods.
Among the key reasons for the success of deep neural networks are important assumptions on the statistical properties of the data, namely stationarity and compositionality through local statistics, which are present in natural images, video, and speech. From the geometric perspective, one can think of such signals as functions on the Euclidean space (plane), sampled on a grid. In this case, stationarity is owed to shift-invariance, locality is due to the local connectivity, and compositionality stems from the multi-resolution structure of the grid. These properties are exploited by convolutional neural networks (CNNs), which are built of alternating convolutional and downsampling (pooling) layers. The use of convolutions allows extracting local features that are shared across the image domain and greatly reduces the number of parameters in the network with respect to generic deep architectures, without sacrificing the expressive capacity of the network. The parameters of different layers are learned by minimizing some task-specific cost function.
Dealing with signals such as speech, images, or video on 1D-, 2D- and 3D Euclidean domains, respectively, has been the main focus of research in deep learning for the past decades. However, in recent years, more and more fields have had to deal with data residing on non-Euclidean geometric domains (referred to here as geometric data for brevity).
For instance, in social networks, the characteristics of users can be modeled as signals on the vertices of the social graph. In genetics, gene expression data are modeled as signals defined on the regulatory network. In computer graphics and vision, 3D shapes are modeled as Riemannian manifolds (surfaces) endowed with properties such as color texture or motion field (e.g. dynamic meshes). Even more complex examples include networks of operators, such as functional correspondences or difference operators in a collection of 3D shapes, or orientations of overlapping cameras in multi-view vision (“structure from motion”) problems. Furthermore, modeling high-dimensional data with graphs is an increasingly popular trend in general data science, where graphs are used to describe the low-dimensional intrinsic structure of the data.
On the one hand, the complexity of geometric data and availability of large datasets (in the case of social networks, on the order of billions of examples) make it tempting and very desirable to resort to machine learning techniques. On the other hand, the non-Euclidean nature of such data implies that there are no such familiar properties as global parametrization, common system of coordinates, vector space structure, or shift-invariance. Consequently, basic operations such as linear combination or convolution that are taken for granted in the Euclidean case, are even not well defined on non-Euclidean domains.
This happens to be a major obstacle that so far has precluded the use of successful deep learning methods such as convolutional or recurrent neural networks on non-Euclidean geometric data. As a result, the quantitative and qualitative breakthrough that deep learning methods have brought into speech recognition, natural language processing, and computer vision has not yet come to fields such as computer graphics or computational sociology. Given the great success of CNNs in computer vision, devising a non-Euclidean formulation of CNNs could lead to a breakthrough in many fields wherein data reside on non-Euclidean domains.
Many machine learning techniques successfully working on images were tried “as is” on 3D geometric data, represented for this purpose in some way “digestible” by standard frameworks. In particular, several prior art methods applied traditional Euclidean CNNs for shape classification, where the 3D geometric structure of the shapes was represented as a set of range images or a rasterized volume. The main drawback of such approaches is their treatment of geometric data as Euclidean structures. First, for complex 3D objects, Euclidean representations such as depth images or voxels may lose significant parts of the object or its fine details, or even break its topological structure. Second, Euclidean representations are not intrinsic, and vary due to pose or deformation of the object. Achieving invariance to shape deformations, a common requirement in many vision applications, is extremely hard with the aforementioned methods and requires huge training sets due to the large number of degrees of freedom involved in describing non-rigid deformations.
Referring to FIG. 1, an application of volumetric CNN to a deformable shape is illustrated. A cylinder shape 1, to which a non-rigid deformation is applied, is depicted. A 4×4×4 3D filter 3a (represented with cubes) constituting part of the volumetric CNN is applied at point 2 on the cylinder 1 before the deformation, and the 3D filter 3b is applied at the same point 2 after the deformation, and these are different. Darkened cubes represent the elements of the filter that correlate with the shape 1. It is evident from FIG. 1 that different filters have to be applied to the cylinder 1 and its deformed version.
For more abstract geometric data, such as graphs or networks, a Euclidean representation may not exist at all. One therefore has to generalize signal processing and learning methods to graphs, a research field generally referred to as signal processing on graphs.
Traditional signal processing has been developed primarily for linear shift-invariant (LSI) systems, which naturally arise when dealing with signals on Euclidean spaces. In this framework, which dates back to the first computers and is based on mathematics that is several centuries old, basic filtering operations can be represented as convolutions, linear shift-invariant operators. The fundamental property that convolution operators are diagonalized in the Fourier basis on Euclidean domains (colloquially known as the “Convolution Theorem”), together with fast numerical algorithms for Fourier transform computation (FFT), have been the main pillar of signal and image processing in the late part of the 20th century.
Spectral analysis techniques were extended to graphs considering the orthogonal eigenfunctions of the Laplacian operator as a generalization of the Fourier basis. Constructions such as wavelets, short-time Fourier transforms, or algorithms such as dictionary learning originally developed for the Euclidean domain, were also generalized to graphs.
Bruna et al. employed a spectral definition of “convolution”, where filters are defined by their Fourier coefficients in the graph Laplacian eigenbasis. In classical signal processing in Euclidean spaces, by virtue of the Convolution Theorem, the convolution of two functions can be computed in the frequency domain as a product of their respective Fourier transforms:ƒ*g=−1(ƒ·g),where , −1 denote the forward and inverse Fourier transforms, respectively, ƒ, g are some functions, and * denotes the convolution operation. On a graph, the convolution may be defined by the above formula, where the Fourier transform is understood as projection on the graph Laplacian eigenbasis. This method is designed to work on a single graph; a spectral model learned on one graph is in general not transferable to another one, since the filters are expressed with respect to a basis that is graph-specific (even for isometric graphs, the Laplacian eigenbases are defined up to sign).
Referring to FIG. 2, an example illustrating the difficulty of generalization across non-Euclidean domains is shown. A function defined on a human shape 4 (function values are represented by color shades) undergoes edge-detection filtering in the frequency domain resulting in function 5. The same filter applied on the same function but on a different (nearly-isometric) shape 6 produces a completely different result.
Generally speaking, this seems to be a common plight of most existing methods for signal processing and learning on graphs, which should be more appropriately referred to as “signal processing and learning on a graph”. While at a first glance this seems to be a subtle difference, for machine learning algorithms, the generalization ability is a key requirement.