Although a human viewing an image may be able to readily discern what the image depicts, the same cannot be said of machines. An image that is stored digitally is represented as a series of 0's and 1's (or bits). At a higher level of abstraction, an image is made up of a collection of pixels, which have no interdependency. Each pixel is associated with color information that is represented by a predetermined number of bits (e.g., 8 bits). Thus, to a machine, a digital image is a collection of color information that specifies what colors should be displayed by a corresponding collection of pixels. Without more, the machine would be unable to discern, from only the bits representing color information, what objects are depicted in the image and where those objects are located within the image.
Machine learning may be used to enable machines to automatically detect and process objects appearing in images. In general, machine learning typically involves processing a training data set in accordance with a machine-learning model and updating the model based on a training algorithm so that it progressively “learns” the features in the data set that are predictive of the desired outputs. The architecture of the machine-learning model, along with how it is trained and what training data is used, determines what the trained model would be capable of doing.
One example of a machine-learning model is a neural network, which is a network of interconnected nodes. Groups of nodes may be arranged in layers. The first layer of the network that takes in input data may be referred to as the input layer, and the last layer that outputs data from the network may be referred to as the output layer. There may be any number of internal hidden layers that map the nodes in the input layer to the nodes in the output layer. In a feed-forward neural network, the outputs of the nodes in each layer—with the exception of the output layer—are configured to feed forward into the nodes in the subsequent layer.