Dimensionality reduction, in general terms, relates to presenting high-dimensional data within a lower-dimensional representation such that representation depicts the structure of the high-dimensional data as accurately as possible. Dimensionality reduction has many applications, such as identifying otherwise unknown relationships between objects. Generally, dimensionality reduction includes showing relationships between objects in a lower-dimensional representation, such as two-dimensional (2-D) visualization (e.g., scatter plot) or three-dimensional (3-D) visualization.
Several dimensionality reduction techniques such, as t-distributed Stochastic Neighbor Embedding (t-SNE), have been successfully applied to real-world datasets to unveil the underlying structure of the datasets. However, this and other dimensionality reduction techniques suffer from a number of problems. For instance, these conventional dimensionality reduction systems fail to produce accurate results when information for a dataset is incomplete. For example, while conventional systems provide lower-dimensional representations of higher-dimensional data, relationships shown in the lower-dimensional representations often do not accurately reflect corresponding relationships found in the original space (e.g., the representation does not reflect a truthful embedding of the original high-dimensional data).
As another problem, conventional systems have limited scalability due to the high computational complexity required by existing dimensionality reduction techniques. In particular, as the size of a dataset increases, the processing time and resources increases exponentially. This problem is further amplified as technological advances have enabled lower costs for collecting and gathering data, leading to ever increasing datasets. Accordingly, conventional systems can become computationally slow or even limited when handling large datasets.
Another technique that conventional systems employ for dimensionality reduction is t-distribution Stochastic Triplet Embedding (t-STE). t-STE uses triplets to learn an embedding for a set of objects based on relative distance comparisons (called “triplet embedding”). As used herein, the term “triplet” refers to a set of three items or objects (e.g., multi-dimensional data points) that compares the relative distance (i.e., similarity) of the first item in the triplet to the remaining two items. For example, a triplet that includes the items A, B, and C can indicate that Item A is more similar to Item B than to Item C. Accordingly, the pairwise distance between Item A and Item B is shorter that the pairwise distance between Item A and Item C. As described herein, the term “triplet constraint” refers to the above-mentioned similarity relationship. Further, relative similarity comparisons for a triplet are often provided in the form of (i,j|k), meaning that “item i is more similar to j than k.”
As mentioned above, the similarity comparison in a triplet is often a relative measurement. In other words, while a triple indicates that a first item is more similar to a second item than to a third item, the triplet provides no indication as to why the first item is more similar to the second item or which similarity metric was used to arrive at this conclusion. As such, triplets enable a human evaluator to compare items that do not have quantitative values. Stated differently, human evaluators can easily create triplets based on similarity comparisons rather than absolute comparisons. For example, a human evaluator determines that Movie A and Movie B are more alike than Movie A and Movie C as opposed to determining that Movie A and Movie B are 70% alike while Movie A and Movie C are 45% alike.
Using relative similarity comparisons, however, has created additional problems for conventional systems. Specifically, a large portion of triplet data is gathered through crowdsourcing platforms where human evaluators manually judge similarities between items. Because human evaluators may consider different notions of similarity, triplets often include inconsistent and conflicting similarity constraints. As an example of a conflicting similarity constraint, one person determines that Movie A is more like Movie B (e.g., (A, B|C)), while another person determines that Movie A is more like Movie C (e.g., (A, C|B)).
The presence of inconsistent or conflicting similarity constraints in a dataset is called noise. Noise in a dataset that includes crowdsourced triplets is almost unavoidable due to the different skill levels or opinions of human evaluators. A major drawback of conventional systems, and t-STE in particular, is their incapability to handle datasets with even low levels of noise. In particular, conventional systems overcompensate and overcorrect when a noisy or outlier triplet is present, which results in an inaccurate dimensionality reduction, which again leads to an untruthful embedding of the underlying data. Visual examples illustrating how the performance of conventional systems drops drastically with the introduction of noise are provided below in connection with the figures.
These along with additional problems and issues exist with regard to current and traditional dimensionality reduction methods and techniques. Accordingly, there remains a need for an improvement in the area of dimensionality reduction.