As the costs of data storage have declined over the years, and as the ability to interconnect various elements of the computing infrastructure has improved, more and more data pertaining to a wide variety of applications can potentially be collected and analyzed using increasingly sophisticated machine learning algorithms. The analysis of data collected from sensors embedded within airplane engines, automobiles, health monitoring devices or complex machinery may be used for various purposes such as preventive maintenance, proactive health-related alerts, improving efficiency and lowering costs. Streaming data collected from an online retailer's websites can be used to make more intelligent decisions regarding the quantities of different products which should be stored at different warehouse locations, and so on. Data collected about machine servers may be analyzed to prevent server failures. Photographs and videos may be analyzed, for example, to detect anomalies which may represent potential security breaches, or to establish links with other photographs or videos with a common subject matter.
Some machine learning algorithms, including for example various types of neural network models used for “deep learning” applications include image analysis applications, may typically require long training times due to the need for a large number of training examples. Often, the number of parameters whose values are to be learned in such models may be quite large—e.g., the model may comprise a number of internal layers, also referred to as “hidden” layers, each with their own sets of parameters, and the total number of parameters may be in the millions. As a result, training data sets with tens or hundreds of millions of examples may have to be found and labeled for some applications.
The problem of large training data sets may become even more burdensome when combinations of input records have to be constructed. For example, for some types of image similarity analysis algorithms, pairs of images may have to be analyzed together, and the total number of pairs which can be generated from even a moderately-sized collection of images may quickly become unmanageable. Of course, if the training data set selected for a model (e.g., using random sampling of the available examples) is too small, the quality of the model results may suffer. Even though the available processing power and storage space of computing systems used for model training has increased substantially over time, training high-quality models for which extremely large training data sets may be generated remains a challenging technical problem.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to. When used in the claims, the term “or” is used as an inclusive or and not as an exclusive or. For example, the phrase “at least one of x, y, or z” means any one of x, y, and z, as well as any combination thereof.