Artificial intelligence is used in a variety of contexts. For example, facial recognition (or face recognition) oftentimes makes use of artificial intelligence techniques. Computerized facial recognition technology has many useful applications including, for instance, automatic identification of potential security threats in airports and other locations, in mobile and other computing platforms in connection with biometric authentication techniques, etc. Facial recognition is even becoming popular in more commercial applications such as, for example, payment authentication.
With recent advancements in the computer science field of deep learning, state-of-the-art face recognition techniques have been able to take advantage of metric learning. This approach typically involves learning a discriminative large-margin facial feature vector representation with a convolutional neural network, and using this vector representation to compare the facial features of a given image with faces within databases. In the authentication context, for example, in contrast with more traditional username and password based authentication where the system only compares the password provided by a known user (e.g., because the username is specified and thus the system “knows” who the user is or is purporting to be), face recognition systems do not know who the user is or is purporting to be. As a result, such systems typically try to find the best vector match in a database and determine whether the similarity exceeds a threshold. It will be appreciated, of course, that such vector matching is used in a variety of contexts beyond simply the authentication context, e.g., as noted above.
In a nutshell, many state-of-the-art face recognition algorithms represent facial features as a vector, such that the same person has a similar representation and the vectors of different persons are highly discriminative. Although the dimensions of vectors oftentimes are restricted by carefully designed convolutional neural networks and dimensionality reduction techniques like Principal Component Analysis, the final feature vectors are still high dimensional. For example, the “SphereFace Deep Hypersphere Embedding for Face Recognition” paper describes extracting facial features from the output of the first fully connected layer of the convolutional neural network, which is a 512-dimensional vector. The final representation is obtained by concatenating the original facial features and horizontally flipped features. The cosine distance metric and nearest neighbor classifier are used for face identification.
As another example approach, “Deep 3D Face Identification” involves a facial feature representation that is a 4096-dimensional vector that is normalized by taking the square root of each element. Then, Principal Component Analysis is performed to reduce the dimensions, facilitating a comparison of the vectors with linear searching and cosine similarity.
Somewhat similarly, “NormFace L2 Hypersphere Embedding for Face Verification” involves extract features from both the frontal face and its mirror image, and merging the two features by element-wise summation. Principal Component Analysis is then applied, and a similarity score is computed using the cosine distance of two features.
As still another example, in “Face Recognition via Centralized Coordinate Learning,” the trained network generates two 374-dimensional feature vectors from a pre-aligned image and its corresponding flipped image. These two vectors are averaged and then normalized into a unit-length vector. The similarity score between two face images is computed by the cosine distance of feature vectors.
Conventional vector matching algorithms like those that use KD-trees work well with low-dimensional vectors. Yet facial feature vectors commonly have hundreds or many hundreds (and sometimes thousands) of dimensions, as shown above. Conventional algorithms and data structures in general cannot support this dimensionality, and it simply is or becomes “too much” for many of them. Moreover, a linear or brute force search oftentimes will be too slow for large databases.
In a related regard, many conventional vector matching algorithms arrange vectors in a particular way, which can be a convenient reference for later searching. Although such approaches typically are much faster than linear searches, they unfortunately still suffer from the so-called curse of dimensionality, which makes them generally unsuitable for high dimensional vector matching. As is known, the curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings and, in general, relates to the fact that, when the dimensionality of a space increases, the volume of the space increases so fast that the available data becomes sparse, presenting problems in terms of data organization, etc.
For instance, an often-used way to cluster multiple dimensional data is KD-trees. Those kinds of trees take a value on one dimension and separate all data into categories that are either higher or lower than the separation value. As a result, the data are separated based on one dimension. For each separation, this process is repeated with a different dimension. This will be done with all dimensions. However, KD-trees do not work well with higher dimensions, especially when the amount of data points are low (one normally wants to have a minimum of 2k data points where k is the number of dimensions). In the case of face recognition, feature vectors are usually have many hundreds of dimensions; thus, even for a 100-dimensional vector more than 1e30 (2100) data values are needed. Therefore, these conventional approaches generally are not applicable for this type of application.
It will be appreciated that it would be desirable to improve upon current vector matching techniques, especially in the facial recognition context. For example, it will be appreciated that it would be desirable to reduce the search space, thereby speeding up the facial recognition and/or improving its accuracy.
One aspect of certain example embodiments relates to addressing the above-described and/or other issues. For example, one aspect of certain example embodiments relates to an approach to accelerating facial feature vector matching, e.g., to speed up the facial recognition and/or improve its accuracy. Certain example embodiments leverage the special properties of facial feature vectors and supervised machine learning techniques. In certain example embodiments, latency associated with the facial feature vector matching can be greatly improved, potentially resulting in one-quarter or similar amount of space to be searched (e.g., depending on how many classifiers are used, as explained in greater detail below), without having to rely on (but still allowing for) a reduction of the vector size.
Certain example embodiments advantageously improve computerized facial recognition techniques by combining hard and soft parameter sharing techniques in connection with two different neural networks that are a part of the same overall system. Certain example embodiments advantageously use small size facial feature vectors as neural network inputs and reduce the search space by implementing multi-task learning classifiers.
In certain example embodiments, a facial recognition system is provided. An electronic interface is configured to receive images including faces. A data store stores reference facial feature vectors and associated classification features. Processing resources including at least one hardware processor and a memory coupled thereto are configured to support: a face recognition neural network configured to generate facial feature vectors for faces in received images; and a multi-task learning classifier network configured to receive facial feature vectors generated by the face recognition neural network and output a plurality of classifications based on the facial feature vectors for the corresponding images. The classifications are built by respective sub-networks each comprising a plurality of classification layers arranged in levels. The classification layers in at least some of the levels receive as input intermediate outputs from an immediately upstream classification layer in the same sub-network and an immediately upstream classification layer in each other sub-network. The processing resources are configured to control at least a part of the facial recognition system to at least: receive from the face recognition neural network a resultant facial feature vector corresponding to an input image received over the electronic interface; receive from the multi-task learning classifier network a plurality of resultant classifiers for the input image and based at least in part on the resultant facial feature vector; search the data store for a subset of reference facial feature vectors, the subset of reference facial feature vectors having associated classification features matching the resultant classifiers; and perform vector matching to identify which one(s) of the reference facial feature vectors in the subset of reference facial feature vectors most closely match(es) the resultant facial feature vector.
In certain example embodiments, a facial recognition system is provided. An electronic interface is configured to receive images including faces. A data store stores reference facial feature vectors and associated classification features. Processing resources including at least one hardware processor and a memory coupled thereto are configured to support: a face recognition neural network configured to generate facial feature vectors for faces in received images, the generated facial feature vectors having fewer than 512 dimensions; and a multi-task learning classifier neural network configured to receive facial feature vectors generated by the face recognition neural network and output a plurality of classifications based on the facial feature vectors for the corresponding images. The classifications are built by respective sub-networks each comprising a plurality of classification layers arranged in levels. The classification layers in at least some of the levels receive as input intermediate outputs from an immediately upstream classification layer in the same sub-network and an immediately upstream classification layer in each other sub-network. The face recognition neural network and the multi-task learning classifier neural network are separately trained. The multi-task learning classifier neural network is trained with data from a subset of training images used to train the face recognition neural network, the subset including classification labels corresponding to classifications that can be output by the multi-task learning classifier neural network.
In certain example embodiments, a facial recognition system is provided. A camera is configured to receive images including faces. A database stores reference facial feature vectors and associated classification features. Processing resources including at least one hardware processor and a memory coupled thereto are configured to support: a face recognition neural network configured to generate facial feature vectors for faces in received images, the generated facial feature vectors having fewer than 512 dimensions; and a multi-task learning classifier neural network configured to receive facial feature vectors generated by the face recognition neural network and output a plurality of classifications based on the facial feature vectors for the corresponding images. The classifications are built by respective sub-networks that implement soft parameter sharing therebetween. The processing resources are configured to control at least a part of the facial recognition system to at least: receive from the face recognition neural network a resultant facial feature vector corresponding to an input image received from the camera; receive from the multi-task learning classifier neural network a plurality of resultant classifiers for the input image and based at least in part on the resultant facial feature vector; search the database for a subset of reference facial feature vectors, the subset of reference facial feature vectors having associated classification features matching the resultant classifiers; and identify which one(s) of the reference facial feature vectors in the subset of reference facial feature vectors most closely match(es) the resultant facial feature vector. The face recognition neural network and the multi-task learning classifier neural network are separately trained. The multi-task learning classifier neural network is trained with data from a subset of training images used to train the face recognition neural network, with the subset including classification labels corresponding to classifications that can be output by the multi-task learning neural classifier network.
According to certain example embodiments, the generated facial feature vectors have fewer than 512 dimensions, e.g., 100-300 dimensions.
According to certain example embodiments, the sub-networks may be binary and/or other classifiers, and there may be two, three, or more different sub-networks in different example embodiments. According to certain example embodiments, there may be two, three, four, or more levels in the various different sub-networks.
According to certain example embodiments, the processing resources help facilitate hard parameter sharing between the face recognition neural network and the multi-task learning classifier network, and soft parameter sharing between sub-networks in the multi-task learning classifier network. For example, cross-stitching may enable the facial recognition techniques to implement both shared and non-shared layers.
According to certain example embodiments, the input image may be received in connection with an authentication request; the reference facial feature vectors and associated classification features in the data store may correspond to known users; and/or the processing resources may be further configured to control at least a part of the facial recognition system to at least determine whether the one(s) of the reference facial feature vectors in the subset of reference facial feature vectors most closely match(es) the resultant facial feature vector exceeds a defined threshold and, conditioned upon that determination being satisfied, issue an authentication command.
According to certain example embodiments, the vector matching may be performed in connection with a cosine similarity calculation; and the processing resources may be further configured to control at least a part of the facial recognition system to at least determine whether the one(s) of the reference facial feature vectors in the subset of reference facial feature vectors most closely match(es) the resultant facial feature vector exceeds a defined threshold and, conditioned upon that determination being satisfied, issue a predefined command or instruction.
According to certain example embodiments, the face recognition neural network and the multi-task learning classifier network may be separately trained. For example, the processing resources may be further configured to control at least a part of the facial recognition system to at least normalize training images used to train the face recognition neural network and/or the multi-task learning classifier network (e.g., by detecting and cutting face areas for the training images, and selectively resizing the cut face areas so that the cut face areas are of a common size, etc.); to train the multi-task learning classifier network with data from a subset of the training images that includes classification labels corresponding to classifications that can be output by the multi-task learning classifier network; etc.
According to certain example embodiments, the face recognition neural network may be a convolutional neural network trainable in connection with a softmax loss function, or variant thereof.
In addition to the features of the previous paragraphs, counterpart methods, non-transitory computer readable storage media tangibly storing instructions for performing such methods, executable computer programs, and the like, are contemplated herein, as well. Similarly, servers, client devices, and the like, usable in connection with the systems laid out in the previous paragraphs, also are contemplated herein.
These features, aspects, advantages, and example embodiments may be used separately and/or applied in various combinations to achieve yet further embodiments of this invention.