The general problem of analyzing images is difficult, in part because images are typically high dimensional objects. Each elemental component of an image, such as a pixel, may itself comprise several dimensions corresponding to various color and intensity values, for example. Thus, a typical rectangular RGB image having a resolution of, say, 1,000×1,000 pixels, may have a dimensionality numbering in the millions. This high dimensionality can lead to computationally expensive operations for many algorithms, especially machine learning algorithms. High dimensionality also complicates automated visualization or identification of object properties, such as similarity between two images of the same object. For example, the same object may have many different appearances, depending on a variety of factors. In addition, the same object may be viewed both with and without deformation, which changes its appearance in the high dimensional image space, but does not change the object's identity. Objects may be viewed from different acquisition geometries, including different translations, rotations, foci, angular ranges of fields of view, etc. Objects may be captured from different cameras whose properties alter the high dimensional representations of the objects within their fields of view. Differences between high dimensional images of an object may be due to object type, object property (such as color, size, etc.), object pose, object transformation, camera effect, or other difference. The complexity and interdependencies of relationships between an object and a high dimensional image of an object makes the computation of similarity metrics between images difficult. The definition of a general-purpose image similarity function is an unsolved problem in computer vision (Wang et al., 2014), and many simple metrics (for example, mean square error (“MSE”), or other pixel-based metrics, such as nonlocal means (Chaudhury, 2013)) to sophisticated metrics (for example, scale invariant feature transform descriptors, or “SIFT” descriptors (Lowe, 1999), or other descriptor-based metrics) have been proposed and used for many tasks in computing similarities between images. The performance of image similarity metrics is highly dependent on the type of image and the particular task, such as recognition of an object label. One of the ways to cope with high dimensionality in the computation of meaningful similarities between images is to embed a high dimensional data object into a lower dimensional space that still captures most of the objects' salient properties (i.e., its features and/or similarity properties) relevant to the application at hand (Norouzi et al., 2013). Such an embedding can be thought of as a summary of the high dimensional image that is invariant to common differences between images of the same object type, for instance.
Many dimensionality reduction and/or embedding algorithms exist today. For brevity we use the term “embedding” to refer to the result of any dimensionality reduction or embedding algorithm applied to an object or a plurality of objects. We also acknowledge that our discussions of the prior art can be found in Van der Maaten & Hinton, 2008.
Popular embedding algorithms include Principal Component Analysis (“PCA”), t-distributed Stochastic Neighbor Embedding (“t-SNE”) (Van der Maaten & Hinton, 2008), Sammon mapping (De Ridder & Duin, 1997), locally linear embedding (“LLE”) (Roweis & Saul, 2000), isometric feature mapping (“ISOMAP”) (Bengio et al., 2004), and multidimensional scaling (“MDS”) (Shepard, 1980). One of skill in the art will appreciate that there are many others and combinations of those listed. According to Van der Maaten & Hinton (Van der Maaten & Hinton, 2008), a large number of nonlinear dimensionality reduction techniques that aim to preserve the local structure of data have been proposed, many of which are reviewed by Lee and Verleysen (Lee & Verleysen, 2007). In particular, we mention the following taxonomy of embeddings broken into two groups. The first seven techniques are representative of formal embeddings and the eighth and ninth techniques are byproduct embeddings. This taxonomy of embedding is as follows: (1) Sammon mapping (De Ridder & Duin, 1997), (2) curvilinear components analysis (“CCA”; (Demartines & Hérault, 1997)), (3) Stochastic Neighbor Embedding (“SNE”; (G. E. Hinton & Roweis, 2002)), (4) ISOMAP (Bengio et al., 2004), (5) Maximum Variance Unfolding (“MVU”; (Weinberger & Saul, 2006)), (6) Locally Linear Embedding (“LLE”; (Roweis & Saul, 2000)), (7) Laplacian Eigenmaps (Belkin & Niyogi, 2007), (8) Autoencoders (G. E. Hinton & Salakhutdinov, 2006), and (9) intermediate hidden representations from a deep analyzer designed for some other purpose (such as object recognition features used for search (Krizhevsky, Sutskever, & Hinton, 2012), e.g.). Despite the strong performance of these techniques on artificial data sets and in some cases on real data, they are often not very successful at compactly embedding real, high-dimensional data in a way that is interpretable by a human being. In particular, most of the techniques are not capable of retaining both the local and the global structure of the data in a single embedding. For instance, a recent study reveals that even a semi-supervised variant of MVU is not capable of separating handwritten digits into their natural clusters (Song, Gretton, Borgwardt, & Smola, 2007).
In general, dimensionality reduction algorithms like the ones listed above operate on a collection of high dimensional data objects to create a much smaller-dimensioned manifold or topological space that preserves certain desired structural features of the objects. The embedding typically achieves an order of magnitude or more reduction in the dimensionality of the original objects, and often a typical dimensionality reduction in the embedding to between 1/10th and 1/100th of the original number of dimensions.
FIG. 1 illustrates a conceptual embedding of various image representations of the letter ‘A’ using a nonlinear dimensionality reduction algorithm. In FIG. 1, high dimensional objects 110 correspond to various representations of the letter ‘A’, which are tiled into a collection of 17*13=221 individually rotated and scaled images. The original collection of high dimensional objects 110 contains gray-scale images of the letter ‘A’ (e.g., items 111/113/115/117/119) that have been scaled and rotated by varying amounts. Every individual image of a letter ‘A’ in the collection 110 may comprise, for example, 64×64 gray-valued pixels corresponding to an object dimensionality of 642=4096 dimensions per image. The vertical axis of the tiling shows 17 samples in increasing clockwise order of rotations from image tiles each with a −90 degree rotation in the top row 111-113 to a +90 degree rotation in the bottom row 115-119. The horizontal axis of the tiling shows 13 samples in decreasing scale from column 111-115 to column 113-119. In this example of an embedding 120, the embedding algorithm has discarded the correlated information in the images (i.e., the letter ‘A’ itself) and has recovered only the information that varies across the images; that is, rotation and scale. The resulting two-dimensional plot 120 represents the embedded space illustrating the results of the embedding. The corresponding 2D embedded vectors 120 of the high dimensional objects 110 has a 1:1 correspondence, meaning that for every image in the collection of high dimensional objects 110, there is exactly one dot in the collection of 2D embedded vectors 120 (i.e., each 2D embedded vector is represented by a dot). For instance, the largest scale −90 degree rotated image of the letter ‘A’ 111 corresponds to a single dot 121 in the collection of 2D embedded vectors 120. Similarly, the smallest scale −90 degree rotated image of the letter A 113 corresponds to a different dot 123 in the collection of 2D embedded vectors 120. Similarly, the largest scale +90 degree rotated image of the letter ‘A’ 115 corresponds to a different dot 125 in the collection of 2D embedded vectors 120. Similarly, the smallest scale +90 degree rotated image of the letter ‘A’ 119 corresponds to a different dot 129 in the collection of 2D embedded vectors 120. As is often a desirable property of embeddings, the 2D embedded vectors 120 capture salient properties of the collection of high dimensional objects 110, grouping various scales, for instance in this embedding, into samples along a ray delimited by items 123 through 121 from a central location. In other words, each image (e.g., 111/113/115/117/119) and its corresponding embedding (121/123/125/127/129, respectively) together form pairs in their respective spaces. Every set of samples along an individual ray corresponds to a specific rotation. Note an individual image, say the largest scale, −90 rotated image of the letter ‘A’ 111 comprises many grayscale pixels, say on an M×M rectangular lattice, and is a high dimensional object of dimensionality M×M; its corresponding 2D embedding 121 is represented with only two dimensions, so the embedding effects a dimensionality reduction (and can be considered a kind of data compression).
FIG. 2 illustrates a real result of an embedding of various image representations of handwritten Arabic numerals written in a variety of ways. The high dimensional objects 210 are tiled into a collection of 20×20 (400) individual images of handwritten digits in the range zero to five. In this example, an embedding algorithm (t-SNE, (Van der Maaten & Hinton, 2008)) has converted affinities of the Arabic numerals 210 into probabilistic t-distributions, resulting in a clustering of the numerals into local groups 220, where each local group, say, the group for zero 225, for instance (often termed a cluster) corresponds roughly to a different numeral. The corresponding 2D embedded vectors 220 of these high dimensional objects 210 has a 1:1 correspondence, meaning that for every image in a tile in the collection of 400 high dimensional objects 210, there is exactly one representative copy of that object in the collection of 2D embedded vectors 220 (i.e., each 2D embedded vector is represented by the (x, y) location in 220 of a scaled version of the image from a particular tile in the tiling 210). For clarity, some copies of individual images corresponding to specific locations in the embedding 220 are drawn with a white background to illustrate where individual samples fall in the embedding 220. Note that in this embedding 220, the 2D embedded representations of more visually similar high dimensional objects are closer together in the embedding than visually dissimilar objects. Note, for instance, the cluster 225 of images of the numeral, zero. Distances between 2D embedded representations of zeros are generally smaller than distances between 2D embedded representations of zeros and other numerals (such as four, e.g.), for instance.
All embeddings produce pairs, where each single input object (e.g., a high dimensional object) is paired with one output object (e.g., a low dimensional embedded object). Many embeddings operate by minimizing an energy function of the input objects. During the formal embedding process, the overall value of the energy function is minimized as the high dimensional objects come to equilibrium with each other. One common energy function connects high dimensional objects to each other with a mathematical spring that exerts attraction and/or repulsion forces between the objects, depending on the weighted distance between them (for example). A corresponding low dimensional force may also be defined. A corresponding energy metric of the high dimensional to low dimensional representations may also be defined. Objects are permitted to move as if acted on by these forces, typically for a predefined period or until a steady state is reached. Through the process of embedding, the energy forces change as high and/or low dimensional objects move toward or away from other objects, lowering the overall energy of the embedding. The embedding process typically coevolves the high and/or low dimensional representations to equilibrium.
One problem with computing embeddings as well as the computed embeddings, themselves, is their final equilibrium state depends on each and every one of the high dimensional input objects because they all interact with each other during the embedding process. For this reason, slightly different collections of high dimensional objects may cause their low dimensional counterparts to come to equilibrium in a vastly/qualitatively different configuration. In other words, the embedding can be highly sensitive to the distribution/selection (i.e., the configuration) of high dimensional objects chosen to compute the embedding, and for this reason, the computed embeddings may be of different quality and the same embedding process generally finds different embeddings with new data. We call this the distribution/selection effect.
In some embedding computations, a putative stochastic initialization of low dimensional embedded objects is chosen as a starting point for the embedding process. As a result, the output of each embedding process can be a completely different embedding, even when the set of high dimensional input objects is exactly the same. Such a stochastic initialization effect may compound the distribution/selection effect described above and may cause many formal embeddings to be practically irreproducible in a metric sense, even if for visualization purposes, some qualitative global aspects of computed embeddings may reliably recur in nearly every embedding (such as the separation of certain clusters, or the closeness of other clusters, for instance).
In addition, most of the above-listed embedding algorithms suffer from a number of other restrictions. For example, PCA is fast, but it frequently does not produce intuitively useful embeddings because it restricts the embedding to linear projections of the high dimensional objects. Both t-SNE and MDS can produce expressive embeddings that can capture intuitive properties of the data, including nonlinear relationships between objects, but they can be prohibitively expensive to compute even on small datasets. With various conventions on rotations, PCA can be made to produce the same embedding every time it is provided with the same dataset, but t-SNE and MDS both produce a different embedding every time they are run, even when the source data has not changed. This is because t-SNE and MDS generally start from stochastic initializations of embedded object populations, as discussed above. Though the random seed governing the embedded initialization can be set to create reproducible t-SNE and MDS embeddings from a given collection of high dimensional input objects, this is generally not done because these embeddings can not be reused to repeatably and independently embed other new high dimensional objects of the same type without extensions to the algorithms, themselves.
To the inventor's knowledge, only embeddings from autoencoders and intermediate hidden representations from a deep analyzer, collectively called byproduct embeddings in the present invention, provide a subset of the benefits of deep embeddings described in the present invention, in that byproduct embeddings can simultaneously be (1) a learned function of a deep architecture operating on an input high dimensional object that does not depend on other inputs to compute an embedding; (2) deterministically repeatable after training the deep architecture; and (3) deployable on GPUs and/or FPGAs if the deep architecture can be parallelized. To the inventor's knowledge, byproduct embeddings have never been used to approximate another formal embedding for use as its own intermediate representation, or for other applications that exploit a separate formal embedding representation as described in the present invention. The specific reasons these byproduct embedding methods are not used this way is that other formal embedding methods have been designed for specific downstream application purposes better suited to the applications described herein (such as translation, reduction of required training data, or active learning-based approaches to labeling datasets to enable downstream supervised learning), and other formal embedding methods have computational efficiency advantages compared to, for instance, autoencoders.
Specifically as it relates to the design of the embedded space with formal embedding methods, while byproduct embeddings can technically be deep architectures, akin to those described for use in the deep embeddings described below, these byproduct embeddings are not used the same way other formal embedding methods are (for visualization, e.g.), partly because byproduct embeddings are a feature learning side effect of the training procedure for the deep architecture (which may require far more dimensions than those that can be visualized, for example), rather than an embedding specifically designed for purposes that many of the other formal embedding methods described above have advantages in, such as for visualization (such as t-SNE) or computationally efficient dimensionality reduction (such as PCA). These drawbacks of byproduct embeddings from deep architectures teach away from techniques such as autoencoders and toward other formal embedding methods depending on the purpose of the embedding. Specifically, the unavoidable tradeoffs between human interpretability, generalizability of the embedded representation, and speed of computation in existing available formal embedding methods in the art have taught away from their fusion (as described in the present invention), and toward a choice of a formal embedding more suited to a particular application rather than to design a modular system that allows a design of the embedded space with a formal embedding method designed for a specific purpose (such as t-SNE for visualization, for example) to be encapsulated in a separate deep architecture that is separately optimized and separately deployed. While the development of “parametric t-SNE” (Van der Maaten, 2009) is one attempt to fuse the properties of formal embeddings designed for a specific purpose, it is neither modular (it only computes t-SNE embeddings) nor as computationally efficient to train as the deep embedding method described in the present invention (because parametric t-SNE uses Gibbs sampling and teaches away from backpropagation to optimize the deep architecture that effects the embedding).
While autoencoders (a type of byproduct embedding) can embed high dimensional object inputs, autoencoders in image analysis are generally used to reconstruct data rather than as an intermediate representation for other purposes, at least partially due to computational disadvantages of autoencoders. Specifically, using autoencoders for embedded representations compared to a formal embedding with t-SNE is taught away from in the art: “t-SNE provides computational advantages over autoencoders. An autoencoder consists of an encoder part and a decoder part, whereas parametric t-SNE only employs an encoder network. As a result, errors have to be back-propagated through half the number of layers in parametric t-SNE (compared to autoencoders), which gives it a computational advantage over autoencoders (even though the computation of the errors is somewhat more expensive in parametric t-SNE)” (Van der Maaten, 2009).
To the inventors' knowledge, none of the other commonly known formal embedding algorithms constitute a true deterministic function (in the mathematical sense) for adding new objects (sometimes called “out of sample objects”) to an embedding after an initial formal embedding algorithm (such as from the above-listed) has been executed. That is, none of the commonly known formal embedding algorithms define a relation between a set of inputs and a set of outputs with the property that each input is independently related to exactly one output. This means that in general, new objects cannot be added or deleted from a formal embedding after it has been created without changing the embedding. Any time a new object is added or removed, either a new embedding must be created from scratch, or the embedding process must be restarted from a former state, as the introduction of new objects unpredictably perturbs pre-existing objects in an embedding.
To mitigate the perturbations of all low dimensional embedded objects from the addition and/or removal of one or more high dimensional input objects to be embedded, one could initialize embedding forces from an existing embedding. Forces could be added for all added objects and removed for all removed objects. The embedding process could then be forward-propagated a few time steps from where it was stopped with the new population. But the key concerns of modifying existing embedding algorithms (whether the algorithm is completely restarted or only perturbed from a former state near equilibrium) is that both options are computationally expensive and both employ forces that act on all objects simultaneously, so all embedded objects move around a little bit, even if only one new high dimensional object is added. In the case of a completely new embedding, in general, a stochastic re-initialization and sensitivity to the distribution (i.e., the collection of high dimensional objects to embed) make all new low dimensional embedded objects behave differently from previously embedded objects, thereby making measurements of similarity in the low dimensional embedded space problematic or impossible due to the changing dependence on other objects.
These limitations to existing embedding algorithms complicate trend analyses, limit usefulness, and tie each formal embedding to a specific population used to discover the embedding in the first place.
As an example of a popular formal embedding that illustrates many of the practical difficulties of formal embeddings described above, Van der Maaten & Hinton (Van der Maaten & Hinton, 2008) explains that the process of Stochastic Neighbor Embedding (“SNE”) starts by converting high-dimensional Euclidean distances (with optional weightings) between high dimensional objects into conditional probabilities that represent similarities between the high dimensional objects. SNE can also be applied to data sets that consist of pairwise similarities between high dimensional objects rather than a collection of high-dimensional vector representations, themselves. This pairwise similarity approach is akin to interpreting these similarities as conditional probabilities. For example, human word association data consists of the probability of producing each possible word in response to a given word, as a result of which, human word association data is already in the form required to apply the SNE process. The similarity of high dimensional object, xj, to high dimensional object, xi, is the conditional probability, p(j|i), that xi would pick xj as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian distribution centered at high dimensional object, xi. The self terms, p(i|i), are set to zero, leaving only the pairwise similarities nonzero. For nearby high dimensional object pairs, xi and xj, p(j|i) is relatively high, whereas for widely separated high dimensional objects, p(j|i) will be almost zero (for reasonable values of the variance of the Gaussian, si). The standard deviation, si, for every object, xi, is computed by searching for the value of si that yields an approximately fixed perplexity, where perplexity is 2H(Pi) and H(Pi) is the Shannon entropy (in bits) of the induced distribution over all high dimensional objects, or
      H    ⁡          (              P        i            )        =      -                  ∑        j            ⁢                        p          ⁡                      (                          j              |              i                        )                          ⁢                  log          2                ⁢                              p            ⁡                          (                              j                |                i                            )                                .                    
The corresponding low dimensional embedded vectors corresponding to high dimensional objects xi and xj are yi and yj, respectively. That is, there is a one-to-one mapping (correspondence) between high dimensional objects (each xi) and low dimensional embedded vectors (each yi). The similarity between each yi and yj is computed as if all y's were distributed Gaussian with a constant variance in the low dimensional space. The embedded vector counterparts of the p(i|j) and p(j|i) distributions are the q(i|j) and q(j|i) distributions, and self terms, q(i|i), are also set to zero as for p(i|i). In the case of a perfect embedding, p(j|i) will equal q(j|i), but in general, these distributions will diverge. SNE discovers the embedding by moving the y's to minimize that divergence between p(i|j) and q(i|j). Specifically, in SNE, the divergence, known as the Kullback-Leibler divergence, is defined as follows:
      Kullback    ⁢          -        ⁢    Leibler    ⁢                  ⁢    divergence    =            ∑      i        ⁢          KL      (                                    P            i                    ⁢                                                Q              i                        )                          =                              ∑            i                    ⁢                                    ∑                              j                ≠                i                                      ⁢                                          p                ⁡                                  (                                      j                    |                    i                                    )                                            ⁢                                                log                  ⁡                                      (                                                                  p                        ⁡                                                  (                                                      j                            |                            i                                                    )                                                                                            q                        ⁡                                                  (                                                      j                            |                            i                                                    )                                                                                      )                                                  .                                                        
All embedded vectors, yi are initialized as a random sample from an isotropic Gaussian in the low dimensional embedding space. That is, this is a stochastic initialization of all yi. The SNE process iteratively reduces the Kullback-Leibler divergence between p(j|i) and q(j|i) using gradient descent. Specifically, the derivative of the cost function of the Kullback-Leibler divergence of the p(j|i) and q(j|i) distributions with respect to yi is computed in closed form. Every embedded vector, yi, is moved in the direction of the negative gradient scaled by the gradient descent step size. In some cases, a momentum term is included in the computation of the gradient that adds the current gradient to a sum of exponentially decaying past gradients.
The iterative minimization of the Kullback-Leibler divergence via gradient descent is difficult for a number of reasons, and these all teach away from using similar approaches at scale or in cases where a repeatable metric is required.
First, the computation of all probability densities and computations can be expensive for large numbers of high dimensional objects. For instance, for a modern computer running SNE on a CPU, it is not uncommon for SNE to require multiple hours to converge to an embedding, depending on the dimensionality and intrinsic dimensionality of the x's, the parameters chosen for the embedding (like step size, momentum, magnitude of random perturbations of yi's during the embedding process, a momentum reduction schedule, etc.). Further, because of the difficulties described below, it is not uncommon to run the embedding process multiple times to discover parameters and/or embeddings that produce the best result. The stochastic, computationally intensive, and time intensive runs, iteratively searching jointly for optimization parameters and embedding results, frustrate the practical use of these embeddings for applications that require a repeatable and/or computationally efficient method of embedding a new high dimensional object independently of those use to discover an embedding. To the inventor's knowledge, no application uses such a formal embedding in this way.
Second, the random initialization can have the unintended effect of causing some computed yi embedded distributions to be intrinsically higher energy than others, and lower energy embeddings are generally preferable. To cope with this issue associated with some random initializations, Van der Maaten and Hinton (Van der Maaten & Hinton, 2008) suggest at least two solutions. The first solution is to run SNE multiple times and select the solution with the lowest Kullback-Leibler divergence as the embedding. A second solution (that can be implemented optionally with or without the first), is to add a small amount of noise to every embedded low dimensional vector at the end of every early iteration of the gradient descent at the same time that a momentum term is gradually reduced (akin to annealing methods). These last two workarounds both have the effect of making it less likely that the iterative SNE procedure will become trapped in local minima. While these workarounds may help SNE overcome local minima, these workarounds both exacerbate the unpredictability of the final embeddings due to both the random initializations and the random noise added to embedding vector locations during the SNE computation process. Thus, with the injection of noise to overcome local minima, even close embedding initializations may converge to very different final embeddings—i.e., the embedding process can be highly sensitive to initial conditions and to additional stochastic noise injected in the embedding process to make it more likely to find a lower energy equilibrium embedding.
Third, there is a phenomenon called the “crowding problem” in embeddings, where the intrinsic dimensionality of the distribution of high dimensional objects is larger than the dimensionality of the embedding space. Specifically, as Van der Maaten and Hinton (Van der Maaten & Hinton, 2008) explain, the low dimensional volume of the embedding that is available to accommodate moderately distant high dimensional objects will not be large enough compared with the low dimensional volume available to accommodate very close high dimensional points. Therefore, in order to properly embed small distances, most of the high dimensional objects that are at a moderate distance from a specific high dimensional object will have to be placed much too far away in the low-dimensional embedding. This crowding problem is not specific to SNE, but also occurs in other local techniques, such as t-SNE, Sammon mapping and locally linear embedding (“LLE”) and other formal embeddings.
Fourth, since all embedded objects depend on all other embedded objects, the addition of even one more high dimensional object into the SNE process will perturb all other points, sometimes by large amounts that produce qualitatively different embedding results. In general, embeddings are implicitly defined, and are not functions, per se, that take each high dimensional object as input and quickly compute its low dimensional embedding. While it is theoretically possible to add one or more individual high dimensional objects to the SNE process, in practice, it is not done due to the compounding of difficulties described above, and the sensitivity of all embedded objects to the addition or removal of high dimensional objects. Specifically, embedding new high dimensional objects rearranges originally or previously embedded high dimensional objects, so if a downstream application were to depend on computations based on originally and/or previously embedded objects in a downstream application, those would need to be updated whenever new embedded objects are computed, or whenever embeddings change more than a tolerance, for instance.
SNE is only one example of an iterative embedding that suffers population distribution sensitivity, optimization, crowding, computational and addition/removal sensitivity issues. Similar difficulties arise in more recent and related embedding techniques, such as t-SNE (Van der Maaten & Hinton, 2008), Barnes-Hut-SNE (Van Der Maaten, 2013), UNI-SNE (Cook, Sutskever, Mnih, & Hinton, 2007), tree-based t-SNE (Van Der Maaten, 2014), and parametric t-SNE (Van der Maaten, 2009).
While t-SNE ameliorates the crowding problem by allowing distant high dimensional objects to effectively decouple, it does not eliminate the crowding problem. While parametric t-SNE provides a number of approaches to addressing the crowding problem, Van der Maaten (Van der Maaten, 2009) recommends learning the parameter defining the degrees of freedom. This learning of the degrees of freedom exacerbates the practical computational difficulties in both finding and using embeddings.
The emergence of small form factor, power efficient graphics processing units (“GPUs”) has lead to the deployment of these devices for new applications (on drones and mobile devices, etc.). The example embedding process described above is iterative (i.e., serial) and most commonly has different requirements in memory and computational parallelizability than GPU hardware is typically designed to compute. Therefore, even if it were possible to incrementally update the embedding with one or more additional high dimensional objects, to do so efficiently would require a different runtime profile and hardware configuration than the use case where the system only has access to a preloaded deep architecture running on a power efficient GPU. Deployed applications may require the computation of an embedding for new high dimensional objects faster than the embedding process can compute them (requiring, for instance, the computation of hundreds of high dimensional object embeddings per second), such that existing formal embedding methods can not keep up with the use case (if it requires speed and repeatability, e.g.). This mismatch between computational infrastructure and performance requirements for formal embeddings and the deployed system embodiment of deep learning algorithms has impeded the incorporation of formal embedding methods into deep learning applications, especially applications that may not have the time, memory, or hardware to compute an embedding within the duty cycle of the application (where duty cycle is, for instance, computing a similarity of two high dimensional image objects at a relatively high frame rate—say 30 frames per second).
Before the expressivity of deep learning methods was recognized by the community, multiple groups attempted to learn a modular embedding function with eigenfunctions (Bengio et al., 2004) and/or eigenmaps (Belkin & Niyogi, 2007). However, both (1) the poor fidelity and (2) poor computational performance of eigenfunctions and eigenmaps in approximating embeddings have taught away from this approach of modularizing the approximation of an embedding with a function.
Due to the difficulties of both approximating and computing an embedding as a separate module (as a deep architecture, for instance), in Van der Maaten (Van der Maaten, 2009), an approach was outlined that would couple the two processes to learn a specific t-SNE embedding function directly as a deep architecture. In this case, the deep architecture computed the embedding, itself, by minimizing a t-SNE embedding loss directly with a deep architecture. In this way, the parametric t-SNE approach (Van der Maaten, 2009) is coupled directly to the process of embedding and not modularized from it—therefore there is no approximation error when applying the embedding to new points that were not used to compute the embedding, itself, and the deep architecture enjoys the computational advantages of being deployable on a GPU. One key drawback to the parametric t-SNE approach (Van der Maaten, 2009) is that when learning a new embedding, the deep architecture's loss function, itself, must be explicitly reformulated to effect different embeddings than the loss function for the t-SNE embedding. The parametric t-SNE approach approximated t-SNE (Van der Maaten, 2009), but it is not clear how to, or even if it is possible, to extend such a parametric t-SNE approach generally to approximate other embeddings, such as LLE, ISOMAP, MDS, or MVU, or those listed in the taxonomy of formal embeddings, e.g. These coupling and approximation considerations have taught away from the concept of modularizing the embedding process from the deep architecture. Decoupling the design of the embedding from its deployed embodiment, both computationally and in the hardware required to execute the deep embedding, are foci of the present invention.
The parametric t-SNE approach (Van der Maaten, 2009) is separated into three distinct stages, proceeding through (1) pretraining (2) construction and (3) finetuning stages. The three stage process begins with a computationally expensive Gibbs sampling-based optimization process which diverges radically from modern approaches to train deep architectures with backpropagation. Specifically, the pretraining stage of the parametric t-SNE approach teaches away from backpropagation-based techniques proposed for the present invention of deep embedding, writing: “the three-stage training procedure aims to circumvent the problems of backpropagation procedures that are typically used to train neural networks.” Other arguments also teach away from dropping the pretraining stage in parametric t-SNE, including “preliminary experiments revealed that training parametric t-SNE networks without the pretraining stage leads to an inferior performance” (Van der Maaten, 2009). The Restricted Boltzmann Machines (“RBMs”) in the pretraining stage are composed of Bernoulli and Gaussian distributed hidden units, and also teach away from the newer more effective unit types (such as rectified linear units) and their corresponding initializations (Glorot & Bengio, 2010) and normalization techniques (Ioffe & Szegedy, 2015). The argument for a sampling-based pretraining step, in general, teaches away from improved optimization properties that address many of the problems of backpropagation (Van der Maaten, 2009), but that have been incorporated into some embodiments of the present invention.
Since 2012, it has been discovered that deep architectures, after supervised training, can effect, as a byproduct of training, an implicit embedding, themselves, (called byproduct embeddings in the taxonomy above) and that this representation can be used directly for other applications. In machine translation applications in 2013, for example, a formal PCA embedding of the high dimensional objects (words) using a word2vec embedded space, discovered in some cases by a deep architecture, has been shown to be conserved across different languages, and can improve machine translation results of words and phrases (Mikolov, Le, & Sutskever, 2013). In none of these cases was a deep embedding method as described in the present invention used in these language translation methods. The same use of a formal embedding method to discover a space that would allow translation of other high dimensional objects (such as images) has not been shown, but is a focal application the present invention enables.