Technical Field
The present disclosure generally relates to deep convolutional neural networks (DCNN). More particularly, but not exclusively, the present disclosure relates to a hardware accelerator engine arranged to implement a portion of the DCNN.
Description of the Related Art
Known computer vision, speech recognition, and signal processing applications benefit from the use of deep convolutional neural networks (DCNN). A seminal work in the DCNN arts is “Gradient-Based Learning Applied To Document Recognition,” by Y. LeCun et al., Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998, which led to winning the 2012 ImageNet Large Scale Visual Recognition Challenge with “AlexNet.” AlexNet, as described in “ImageNet Classification With Deep Convolutional Neural Networks,” by Krizhevsky, A., Sutskever, I., and Hinton, G., NIPS, pp. 1-9, Lake Tahoe, Nev. (2012), is a DCNN that significantly outperformed classical approaches for the first time.
A DCNN is a computer-based tool that processes large quantities of data and adaptively “learns” by conflating proximally related features within the data, making broad predictions about the data, and refining the predictions based on reliable conclusions and new conflations. The DCNN is arranged in a plurality of “layers,” and different types of predictions are made at each layer.
For example, if a plurality of two-dimensional pictures of faces is provided as input to a DCNN, the DCNN will learn a variety of characteristics of faces such as edges, curves, angles, dots, color contrasts, bright spots, dark spots, etc. These one or more features are learned at one or more first layers of the DCNN. Then, in one or more second layers, the DCNN will learn a variety of recognizable features of faces such as eyes, eyebrows, foreheads, hair, noses, mouths, cheeks, etc.; each of which is distinguishable from all of the other features. That is, the DCNN learns to recognize and distinguish an eye from an eyebrow or any other facial feature. In one or more third and then subsequent layers, the DCNN learns entire faces and higher order characteristics such as race, gender, age, emotional state, etc. The DCNN is even taught in some cases to recognize the specific identity of a person. For example, a random image can be identified as a face, and the face can be recognized as Orlando Bloom, Andrea Bocelli, or some other identity.
In other examples, a DCNN can be provided with a plurality of pictures of animals, and the DCNN can be taught to identify lions, tigers, and bears; a DCNN can be provided with a plurality of pictures of automobiles, and the DCNN can be taught to identify and distinguish different types of vehicles; and many other DCNNs can also be formed. DCNNs can be used to learn word patterns in sentences, to identify music, to analyze individual shopping patterns, to play video games, to create traffic routes, and DCNNs can be used for many other learning-based tasks too.
FIG. 1 includes FIGS. 1A-1J.
FIG. 1A is a simplified illustration of a convolutional neural network (CNN) system 10. In the CNN system, a two-dimensional array of pixels is processed by the CNN. The CNN analyzes a 10×10 input object plane to determine if a “1” is represented in the plane, if a “0” is represented in the plane, or if neither a “1” nor a “0” is implemented in the plane.
In the 10×10 input object plane, each pixel is either illuminated or not illuminated. For the sake of simplicity in illustration, illuminated pixels are filled in (e.g., dark color) and unilluminated pixels are not filled in (e.g., light color).
FIG. 1B illustrates the CNN system 10 of FIG. 1A determining that a first pixel pattern illustrates a “1” and that a second pixel pattern illustrates a “0.” In the real world, however, images do not always align cleanly as illustrated in FIG. 1B.
In FIG. 1C, several variations of different forms of ones and zeroes are shown. In these images, the average human viewer would easily recognize that the particular numeral is translated or scaled, but the viewer would also correctly determine if the image represented a “1” or a “0.” Along these lines, without conscious thought, the human viewer looks beyond image rotation, various weighting of numerals, sizing of numerals, shifting, inversion, overlapping, fragmentation, multiple numerals in the same image, and other such characteristics. Programmatically, however, in traditional computing systems, such analysis is very difficult. A variety of image matching techniques are known, but this type of analysis quickly overwhelms the available computational resources even with very small image sizes. In contrast, however, a CNN system 10 can correctly identify ones, zeroes, both ones and zeroes, or neither a one nor a zero in each processed image with an acceptable degree of accuracy even if the CNN system 10 has never previously “seen” the exact image.
FIG. 1D represents a CNN operation that analyzes (e.g., mathematically combines) portions of an unknown image with corresponding portions of a known image. For example, a 3-pixel portion of the left-side, unknown image B5-C6-D7 may be recognized as matching a corresponding 3-pixel portion of the right-side, known image C7-D8-E9. In these and other cases, a variety of other corresponding pixel arrangements may also be recognized. Some other correspondences are illustrated in Table 1.
TABLE 1Corresponding known to unknown images segmentsFIG. 1DFIG. 1DLeft-side, unknown imageRight-side, known imageC3-B4-B5D3-C4-C5C6-D7-E7-F7-G6D8-E9-F9-G9-H8E1-F2G2-H3G2-H3-H4-H5H3-I4-I5-I6
Recognizing that segments or portions of a known image may be matched to corresponding segments or portions of an unknown image, it is further recognized that by unifying the portion matching operation, entire images may be processed in the exact same way while achieving previously uncalculated results. Stated differently, a particular portion size may be selected, and a known image may then be analyzed portion-by-portion. When a pattern within any given portion of a known image is mathematically combined with a similarly sized portion of an unknown image, information is generated that represents the similarity between the portions.
FIG. 1E illustrates six portions of the right-side, known image of FIG. 1D. Each portion, also called a “kernel,” is arranged as a 3-pixel-by-3-pixel array. Computationally, pixels that are illuminated are represented mathematically as a positive “1” (i.e., +1); and pixels that are not illuminated are represented mathematically as a negative “1” (i.e., −1). For the sake of simplifying the illustration in FIG. 1E, each illustrated kernel is also shown with the column and row reference of FIG. 1D.
The six kernels shown in FIG. 1E are representative and selected for ease of understanding the operations of CNN system 10. It is clear that a known image can be represented with a finite set of overlapping or non-overlapping kernels. For example, considering a 3-pixel-by-3-pixel kernel size and a system of overlapping kernels having a stride of one (1), each 10×10 pixel image may have 64 corresponding kernels. That is, a first kernel spans the 9 pixels in columns A, C, and rows 1, 2, 3.
A second kernel spans the 9 pixels in columns B, C, D, and rows 1, 2, 3.
A third kernel spans the 9 pixels in columns C, D, E, and rows 1, 2, 3 and so on until an eighth kernel spans the 9 pixels in columns H, I, J, and rows 1, 2, 3.
Kernel alignment continues in this way until a 57th kernel spans columns A, B, C, and rows 8, 9, 10, and a 64th kernel spans columns H, I, J, and rows 8, 9, 10.
In other CNN systems, kernels may be overlapping or not overlapping, and kernels may have strides of 2, 3, or some other number. The different strategies for selecting kernel sizes, strides, positions, and the like are chosen by a CNN system designer based on past results, analytical study, or in some other way.
Returning to the example of FIGS. 1D, and 1E, a total of 64 kernels are formed using information in the known image. The first kernel starts with the upper-most, left-most 9 pixels in a 3×3 array. The next seven kernels are sequentially shifted right by one column each. The ninth kernel returns back to the first three columns and drops down a row, similar to the carriage return operation of a text-based document, which concept is derived from a twentieth-century manual typewriter. In following this pattern, FIG. 1E shows the 7th, 18th, 24th, 32nd, 60th, and 62nd kernels.
Sequentially, or in some other known pattern, each kernel is aligned with a correspondingly sized set of pixels of the image under analysis. In a fully analyzed system, for example, the first kernel is conceptually overlayed on the unknown image in each of the kernel positions. Considering FIGS. 1D and 1E, the first kernel is conceptually overlayed on the unknown image in the position of Kernel No. 1 (left-most, top-most portion of the image), then the first kernel is conceptually overlayed on the unknown image in the position of Kernel No. 2, and so on, until the first kernel is conceptually overlayed on the unknown image in the position of Kernel No. 64 (bottom-most, right-most portion of the image). The procedure is repeated for each of the 64 kernels, and a total of 4096 operations are performed (i.e., 64 kernels in each of 64 positions). In this way, it is also shown that when other CNN systems select different kernel sizes, different strides, and different patterns of conceptual overlay, then the number of operations will change.
In the CNN system 10, the conceptual overlay of each kernel on each portion of an unknown image under analysis is carried out as a mathematical process called convolution. Each of the nine pixels in a kernel is given a value of positive “1” (+1) or negative “1” (−1) based on whether the pixel is illuminated or unilluminated, and when the kernel is overlayed on the portion of the image under analysis, the value of each pixel in the kernel is multiplied by the value of the corresponding pixel in the image. Since each pixel has a value of +1 (i.e., illuminated) or −1 (i.e., unilluminated), the multiplication will always result in either a +1 or a −1. Additionally, since each of the 4096 kernel operations is processed using a 9-pixel kernel, a total of 36,864 mathematical operations (i.e., 9×4096) are performed at this first stage of a single unknown image analysis in a very simple CNN. It is clear that CNN systems require tremendous computational resources.
As just described, each of the 9 pixels in a kernel is multiplied by a corresponding pixel in the image under analysis. An unilluminated pixel (−1) in the kernel, when multiplied by an unilluminated pixel (−1) in the subject unknown image will result in a +1 indicated a “match” at that pixel position (i.e., both the kernel and the image have an unilluminated pixel). Similarly, an illuminated pixel (+1) in the kernel multiplied by an illuminated pixel (+1) in the unknown image also results in a match (+1). On the other hand, when an unilluminated pixel (−1) in the kernel is multiplied by an illuminated pixel (+1) in the image, the result indicates no match (−1) at that pixel position. And when an illuminated pixel (+1) in the kernel is multiplied by an unilluminated pixel (−1) in the image, the result also indicates no match (−1) at that pixel position.
After the nine multiplication operations of a single kernel are performed, the product results will include nine values; each of the nine values being either a positive one (+1) or a negative one (−1). If each pixel in the kernel matches each pixel in the corresponding portion of the unknown image, then the product result will include nine positive one (+1) values. Alternatively, if one or more pixels in the kernel do not match a corresponding pixel in the portion of the unknown image under analysis, then the product result will have at least some negative one (−1) values. If every pixel in the kernel fails to match the corresponding pixel in the corresponding portion of the unknown image under analysis, then the product result will include nine negative one (−1) values.
Considering the mathematical combination (i.e., the multiplication operations) of pixels, it is recognized that the number of positive one (+1) values and the number of negative one (−1) values in a product result represents the degree to which the feature in the kernel matches the portion of the image where the kernel is conceptually overlayed. Thus, by summing all of the products (e.g., summing the nine values) and dividing by the number of pixels (e.g., nine), a single “quality value” is determined. The quality value represents the degree of match between the kernel and the portion of the unknown image under analysis. The quality value can range from negative one (−1) when no kernel pixels match and positive one (+1) when every pixel in the kernel has the same illuminated/unilluminated status as its corresponding pixel in the unknown image.
The acts described herein with respect to FIG. 1E may also collectively be referred to as a first convolutional process in an operation called “filtering.” In a filter operation, a particular portion of interest in a known image is searched for in an unknown image. The purpose of the filter is to identify if and where the feature of interest is found in the unknown image with a corresponding prediction of likelihood.
FIG. 1F illustrates twelve acts of convolution in a filtering process. FIG. 1G shows the results of the twelve convolutional acts of FIG. 1F. In each act, a different portion of the unknown image is processed with a selected kernel. The selected kernel may be recognized as the twelfth kernel in the representative numeral one (“1”) of FIG. 1B. The representative “1” is formed in FIG. 1B as a set of illuminated pixels in a 10-pixel-by-10-pixel image. Starting in the top-most, left-most corner, the first kernel covers a 3-pixel-by-3-pixel portion. The second through eighth kernels sequentially move one column rightward. In the manner of a carriage return, the ninth kernel begins in the second row, left-most column. Kernels 10-16 sequentially move one column rightward for each kernel. Kernels 17-64 may be similarly formed such that each feature of the numeral “1” in FIG. 1B is represented in at least one kernel.
In FIG. 1F(a), a selected kernel of 3-pixels by 3-pixels is conceptually overlayed on a left-most, top-most section of an unknown image. The selected kernel in this case is the twelfth kernel of the numeral “1” of FIG. 1B. The unknown image in FIG. 1F(a) may appear to a human observer as a shifted, poorly formed numeral one (i.e., “1”). In the convolutional process, the value of each pixel in the selected kernel, which is “+1” for illuminated pixels and “−1” for unilluminated pixels, is multiplied by each corresponding pixel in the unknown image. In FIG. 1F(a), five kernel pixels are illuminated, and four kernel pixels are unilluminated. Every pixel in the unknown image is unilluminated. Accordingly, when all nine multiplications are performed, five products are calculated to be “−1,” and four products are calculated to be “+1.” The nine products are summed, and the resulting value of “−1” is divided by nine. For this reason, the corresponding image of FIG. 1G(a) shows a resulting kernel value of “−0.11” for the kernel in the left-most, top-most section of the unknown image.
In FIGS. 1F(b), 1F(c), and 1F(d), the kernel pixel is sequentially moved rightward across the columns of the image. Since each pixel in the area of the first six columns and first three rows spanning the first six columns is also unilluminated, FIGS. 1G(b), 1G(c), and 1G(d) each show a calculated kernel value of “−0.11.”
FIGS. 1F(e) and 1G(e) show a different calculated kernel value from the earlier calculated kernel values of “−0.11.” In FIG. 1F(e), one of the illuminated kernel pixels matches one of the illuminated pixels in the unknown image. This match is shown by a darkened pixel in FIG. 1F(e). Since FIG. 1F(e) now has a different set of matched/unmatched characteristics, and further, since another one of the kernel pixels matches a corresponding pixel in the unknown image, it is expected that the resulting kernel value will increase. Indeed, as shown in FIG. 1G(e), when the nine multiplication operations are carried out, four unilluminated pixels in the kernel match four unilluminated pixels in the unknown image, one illuminated pixel in the kernel matches one illuminated pixel in the unknown image, and four other illuminated pixels in the kernel do not match the unilluminated four pixels in the unknown image. When the nine products are summed, the result of “+1” is divided by nine for a calculated kernel value of “+0.11” in the fifth kernel position.
As the kernel is moved further rightward in FIG. 1F(f), a different one of the illuminated kernel pixels matches a corresponding illuminated pixel in the unknown image. FIG. 1G(f) represents the set of matched and unmatched pixels as a kernel value of “+0.11.”
In FIG. 1F(g), the kernel is moved one more column to the right, and in this position, every pixel in the kernel matches every pixel in the unknown image. Since the nine multiplications performed when each pixel of the kernel is multiplied by its corresponding pixel in the unknown image results in a “+1.0,” the sum of the nine products is calculated to be “+9.0,” and the final kernel value for the particular position is calculated (i.e., 9.0/9) to be “+1.0,” which represents a perfect match.
In FIG. 1F(h), the kernel is moved rightward again, which results in a single illuminated pixel match, four unilluminated pixel matches, and a kernel value of “+0.11,” as illustrated in FIG. 1G(h).
The kernel continues to be moved as shown in FIGS. 1F(i), 1F(j), 1F(k), and 1F(l), and in each position, a kernel value is mathematically calculated. Since no illuminated pixels of the kernel are overlayed on illuminated pixels of the unknown image in in FIGS. 1F(i) to 1F(l), the calculated kernel value for each of these positions is “−0.11.” The kernel values are shown in FIGS. 1G(i), 1G(j), 1G(k), and 1G(l) as “−0.11” in the respective four kernel positions.
FIG. 1H illustrates a stack of maps of kernel values. The topmost kernel map in FIG. 1H is formed when the twelfth kernel of the numeral “1” in FIG. 1B is moved into each position of the unknown image. The twelfth kernel will be recognized as the kernel used in each of FIGS. 1F(a) to 1F(l) and FIGS. 1G(a) to 1G(l). For each position where the selected kernel is conceptually overlayed on the unknown image, a kernel value is calculated, and the kernel value is stored in its respective position on the kernel map.
Also in FIG. 1H, other filters (i.e., kernels) are also applied to the unknown image. For simplicity in the discussion, the 29th kernel of the numeral “1” in FIG. 1B is selected, and the 61st kernel of the numeral “1” in FIG. 1B is selected. For each kernel, a distinct kernel map is created. The plurality of created kernel maps may be envisioned as a stack of kernel maps having a depth equal to the number of filters (i.e., kernels) that are applied. The stack of kernel maps may also be called a stack of filtered images.
In the convolutional process of the CNN system 10, a single unknown image is convolved to create a stack of filtered images. The depth of the stack is the same as, or is otherwise based on, the number of filters (i.e., kernels) that are applied to the unknown image. The convolutional process in which a filter is applied to an image is also referred to as a “layer” because they can be stacked together.
As evident in FIG. 1H, a large quantity of data is generated during the convolutional layering process. In addition, each kernel map (i.e., each filtered image) has nearly as many values in it as the original image. In the examples presented in FIG. 1H, the original unknown input image is formed by 100 pixels (10×10), and the generated filter map has 64 values (8×8). The simple reduction in size of the kernel map is only realized because the applied 9-pixel kernel values (3×3) cannot fully process the outermost pixels at the edge of the image.
FIG. 1I shows a pooling feature that significantly reduces the quantity of data produced by the convolutional processes. A pooling process may be performed on one, some, or all of the filtered images. The kernel map in FIG. 1I is recognized as the top-most filter map of FIG. 1H, which is formed with the 12th kernel of the numeral “1” in FIG. 1B.
The pooling process introduces the concepts of “window size” and “stride.” The window size is the dimensions of a window such that a single, maximum value within the window will be selected in the pooling process. A window may be formed having dimensions of m-pixels by n-pixels wherein “m” and “n” are integers, but in most cases, “m” and “n” are equal. In the pooling operation shown in FIG. 1I, each window is formed as a 2-pixel-by-2-pixel window. In the pooling operation, a 4-pixel window is conceptually overlayed onto a selected portion of the kernel map, and within the window, the highest value is selected.
In the pooling operation, in a manner similar to conceptually overlaying a kernel on an unknown image, the pooling window is conceptually overlayed onto each portion of the kernel map. The “stride” represents how much the pooling window is moved after each pooling act. If the stride is set to “two,” then the pooling window is moved by two pixels after each pooling act. If the stride is set to “three,” then the pooling window is moved by three pixels after each pooling act.
In the pooling operation of FIG. 1I, the pooling window size is set to 2×2, and the stride is also set to two. A first pooling operation is performed by selecting the four pixels in the top-most, left-most corner of the kernel map. Since each kernel value in the window has been calculated to be “−0.11,” the value from the pooling calculation is also “−0.11.” The value of “−0.11” is placed in the top-most, left-most corner of the pooled output map in FIG. 1I.
The pooling window is then moved rightward by the selected stride of two pixels, and the second pooling act is performed. Once again, since each kernel value in the second pooling window is calculated to be “−0.11,” the value from the pooling calculation is also “−0.11.” The value of “−0.11” is placed in the second entry of the top row of the pooled output map in FIG. 1I.
The pooling window is moved rightward by a stride of two pixels, and the four values in the window are evaluated. The four values in the third pooling act are “+0.11,” “+0.11,” “+0.11,” and “+0.33.” Here, in this group of four kernel values, “+0.33” is the highest value. Therefore, the value of “+0.33” is placed in the third entry of the top row of the pooled output map in FIG. 1I. The pooling operation doesn't care where in the window the highest value is found, the pooling operation simply selects the highest (i.e., the greatest) value that falls within the boundaries of the window.
The remaining 13 pooling operations are also performed in a like manner so as to fill the remainder of the pooled output map of FIG. 1I. Similar pooling operations may also be performed for some or all of the other generated kernel maps (i.e., filtered images). Further considering the pooled output of FIG. 1I, and further considering the selected kernel (i.e., the twelfth kernel of the numeral “1” in FIG. 1B) and the unknown image, it is recognized that the highest values are found in the upper right-hand corner of the pooled output. This is so because when the kernel feature is applied to the unknown image, the highest correlations between the pixels of the selected feature of interest (i.e., the kernel) and the similarly arranged pixels in the unknown image are also found in the upper right-hand corner. It is also recognized that the pooled output has values captured in it that loosely represent the values in the un-pooled, larger-sized kernel map. If a particular pattern in an unknown image is being searched for, then the approximate position of the pattern can be learned from the pooled output map. Even if the actual position of the feature isn't known with certainty, an observer can recognize that the feature was detected in the pooled output. The actual feature may be moved a little bit left or a little bit right in the unknown image, or the actual feature may be rotated or otherwise not identical to the kernel feature, but nevertheless, the occurrence of the feature and its general position may be recognized.
An optional normalization operation is also illustrated in FIG. 1I. The normalization operation is typically performed by a Rectified Linear Unit (ReLU). The ReLU identifies every negative number in the pooled output map and replaces the negative number with the value of zero (i.e., “0”) in a normalized output map. The optional normalization process by one or more ReLU circuits helps to reduce the computational resource workload that may otherwise be required by calculations performed with negative numbers.
After processing in the ReLU layer, data in the normalized output map may be averaged in order to predict whether or not the feature of interest characterized by the kernel is found or is not found in the unknown image. In this way, each value in a normalized output map is used as a weighted “vote” that indicates whether or not the feature is present in the image. In some cases, several features (i.e., kernels) are convolved, and the predictions are further combined to characterize the image more broadly. For example, as illustrated in FIG. 1H, three kernels of interest derived from a known image of a numeral “1” are convolved with an unknown image. After processing each kernel through the various layers, a prediction is made as to whether or not the unknown image includes one or more pixel patterns that show a numeral “1.”
Summarizing FIGS. 1A-1I, kernels are selected from a known image. Not every kernel of the known image needs to be used by the CNN. Instead, kernels that are determined to be “important” features may be selected. After the convolution process produces a kernel map (i.e., a feature image), the kernel map is passed through a pooling layer, and a normalization (i.e., ReLU) layer. All of the values in the output maps are averaged (i.e., sum and divide), and the output value from the averaging is used as a prediction of whether or not the unknown image contains the particular feature found in the known image. In the exemplary case, the output value is used to predict whether the unknown image contains a numeral “1.” In some cases, the “list of votes” may also be used as input to subsequent stacked layers. This manner of processing reinforces strongly identified features and reduces the influence of weakly identified (or unidentified) features. Considering the entire CNN, a two-dimensional image is input to the CNN and produces a set of votes at its output. The set of votes at the output are used to predict whether the input image either does or does not contain the object of interest that is characterized by the features.
The CNN system 10 of FIG. 1A may be implemented as a series of operational layers. One or more convolutional layers may be followed by one or more pooling layers, and the one or more pooling layers may be optionally followed by one or more normalization layers. The convolutional layers create a plurality of kernel maps, which are otherwise called filtered images, from a single unknown image. The large quantity of data in the plurality of filtered images is reduced with one or more pooling layers, and the quantity of data is reduced further by one or more ReLU layers that normalize the data by removing all negative numbers.
FIG. 1J shows the CNN system 10 of FIG. 1A in more detail. In FIG. 1J(a), the CNN system 10 accepts a 10-pixel-by-10-pixel input image into a CNN. The CNN includes a convolutional layer, a pooling layer, a rectified linear unit (ReLU) layer, and a voting layer. One or more kernel values are convolved in cooperation with the unknown 10×10 image, and the output from the convolutional layer is passed to the pooling layer. One or more max pooling operations are performed on each kernel map provided by the convolutional layer. Pooled output maps from the pooling layer are used as input to a ReLU layer that produces normalized output maps, and the data contained in the normalized output maps is summed and divided to determine a prediction as to whether or not the input image includes a numeral “1” or a numeral “0.”
In FIG. 1J(b), another CNN system 10a is illustrated. The CNN in the CNN system 10a includes a plurality of layers, which may include convolutional layers, pooling layers, normalization layers, and voting layers. The output from one layer is used as the input to a next layer. In each pass through a convolutional layer, the data is filtered. Accordingly, both image data and other types data may be convolved to search for (i.e., filter) any particular feature. When passing through pooling layers, the input data generally retains its predictive information, but the quantity of data is reduced. Since the CNN system 10a of FIG. 1J(b) includes many layers, the CNN is arranged to predict that the input image contains any one of many different features.
One other characteristic of a CNN is the use of back propagation to reduce errors and improve the quality of the neural network to recognize particular features in the midst of vast quantities of input data. For example, if the CNN arrives at a prediction that is less than 1.0, and the prediction is later determined to be accurate, then the difference between the predicted value and 1.0 is considered an error rate. Since the goal of the neural network is to accurately predict whether or not a particular feature is included in an input data set, the CNN can be further directed to automatically adjust weighting values that are applied in a voting layer.
Back propagation mechanisms are arranged to implement a feature of gradient descent. Gradient descent may be applied on a two-dimensional map wherein one axis of the map represents “error rate,” and the other axis of the map represents “weight.” In this way, such a gradient-descent map will preferably take on a parabolic shape such that if an error rate is high, then the weight of that derived value will be low. As error rate drops, then the weight of the derived value will increase. Accordingly, when a CNN that implements back propagation continues to operate, the accuracy of the CNN has the potential to continue improving itself automatically.
The performance of known object recognition techniques that use machine learning methods is improved by applying more powerful models to larger datasets, and implementing better techniques to prevent overfitting. Two known large datasets include LabelMe and ImageNet. LabelMe includes hundreds of thousands of fully segmented images, and more than 15 million high-resolution, labeled images in over 22,000 categories are included in ImageNet.
To learn about thousands of objects from millions of images, the model that is applied to the images requires a large learning capacity. One type of model that has sufficient learning capacity is a convolutional neural network (CNN) model. In order to compensate for an absence of specific information about the huge pool of data, the CNN model is arranged with at least some prior knowledge of the data set (e.g., statistical stationarity/non-stationarity, spatiality, temporality, locality of pixel dependencies, and the like). The CNN model is further arranged with a designer selectable set of features such as capacity, depth, breadth, number of layers, and the like.
Early CNN's were implemented with large, specialized super-computers. Conventional CNN's are implemented with customized, powerful graphic processing units (GPUs). As described by Krizhevsky, “current GPUs, paired with a highly optimized implementation of 2D convolution, are powerful enough to facilitate the training of interestingly large CNNs, and recent datasets such as ImageNet contain enough labeled examples to train such models without severe overfitting.”
FIG. 2 includes FIGS. 2A-2B.
FIG. 2A is an illustration of the known AlexNet DCNN architecture. As described by Krizhevsky, FIG. 1 shows the “delineation of responsibilities between [the] two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network's input is 150,528-dimensional, and the number of neurons in the network's remaining layers is given by 253,440-186,624-64,896-64,896-43,264-4096-4096−1000.”
Krizhevsky's two GPUs implement a highly optimized two-dimensional (2D) convolution framework. The final network contains eight learned layers with weights. The eight layers consist of five convolutional layers CL1-CL5, some of which are followed by max-pooling layers, and three fully connected layers FC with a final 1000-way softmax, which produces a distribution over 1000 class labels.
In FIG. 2A, kernels of convolutional layers CL2, CL4, CL5 are connected only to kernel maps of the previous layer that are processed on the same GPU. In contrast, kernels of convolutional layer CL3 are connected to all kernel maps in convolutional layer CL2. Neurons in the fully connected layers FC are connected to all neurons in the previous layer.
Response-normalization layers follow the convolutional layers CL1, CL2. Max-pooling layers follow both the response-normalization layers as well as convolutional layer CL5. The max-pooling layers summarize the outputs of neighboring groups of neurons in the same kernel map. Rectified Linear Unit (ReLU) non-linearity is applied to the output of every convolutional and fully connected layer.
The first convolutional layer CL1 in the AlexNet architecture of FIG. 1A filters a 224×224×3 input image with 96 kernels of size 11×11×3 with a stride of 4 pixels. This stride is the distance between the receptive field centers of neighboring neurons in a kernel map. The second convolutional layer CL2 takes as input the response-normalized and pooled output of the first convolutional layer CL1 and filters the output of the first convolutional layer with 256 kernels of size 5×5×48. The third, fourth, and fifth convolutional layers CL3, CL4, CL5 are connected to one another without any intervening pooling or normalization layers. The third convolutional layer CL3 has 384 kernels of size 3×3×256 connected to the normalized, pooled outputs of the second convolutional layer CL2. The fourth convolutional layer CL4 has 384 kernels of size 3×3×192, and the fifth convolutional layer CL5 has 256 kernels of size 3×3×192. The fully connected layers have 4096 neurons each.
The eight layer depth of the AlexNet architecture seems to be important because particular testing revealed that removing any convolutional layer resulted in unacceptably diminished performance. The network's size is limited by the amount of memory available on the implemented GPUs and by the amount of training time that is deemed tolerable. The AlexNet DCNN architecture of FIG. 1A takes between five and six days to train on two NVIDIA GEFORCE GTX 580 3 GB GPUs.
FIG. 2B is a block diagram of a known GPU such as the NVIDIA GEFORCE GTX 580 GPU. The GPU is a streaming multiprocessor containing 32 unified device architecture processors that employ a flexible scalar architecture. The GPU is arranged for texture processing, shadow map processing, and other graphics-centric processing. Each of the 32 processors in the GPU includes a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU). The FPU complies with the IEEE 754-2008 industry standard for floating-point arithmetic. The GPU in this case is particularly configured for desktop applications.
Processing in the GPU is scheduled in groups of 32 threads called warps. Each of the 32 threads executes the same instructions simultaneously. The GPU includes two warp schedulers and two instruction dispatch units. In this arrangement, two independent warps can be issued and executed at the same time.
All of the subject matter discussed in the Background section is not necessarily prior art and should not be assumed to be prior art merely as a result of its discussion in the Background section. Along these lines, any recognition of problems in the prior art discussed in the Background section or associated with such subject matter should not be treated as prior art unless expressly stated to be prior art. Instead, the discussion of any subject matter in the Background section should be treated as part of the inventor's approach to the particular problem, which in and of itself may also be inventive.