A relational database (RDB) makes it possible to manage data in a structured manner. For this reason, data stored in an RDB is sometimes referred to as “structured data”. On the other hand, text data produced by speech-to-text conversion of the content of a telephone conversation or a meeting, and audio data produced by recording those events is referred to as “unstructured data”. Aside from text data and audio data, other examples of unstructured data include image data, and sensor data that has been outputted from various types of sensor. In recent years, attention has been focused on technologies that make effective use of unstructured data.
Similarity search technologies for unstructured data are useful when putting large amounts of unstructured data to use. As one example, when unstructured data that is similar to specified unstructured data that was collected in the past can be extracted from a large amount of present unstructured data, it becomes possible to perform time series analysis on unstructured data with a given feature. Similarity search technologies for unstructured data are used when matching patterns, such as fingerprints and veins, when performing personal authentication, and are also used when clustering and classifying unstructured data or when detecting unauthorized access to an information system.
An authentication system that provides a personal authentication service using fingerprint data performs authentication by searching a vast amount of fingerprint data collected from a large number of registered users for fingerprint data that is similar to inputted fingerprint data.
In a similarity search for unstructured data, a feature vector expressing a feature of the unstructured data is used. As one example, an authentication system calculates, as the degree of dissimilarity between fingerprint data, the Euclidean distance between pairs of feature vectors in a feature space and searches for fingerprint data corresponding to feature vectors with the smallest degree of dissimilarity.
The feature vectors generated from unstructured data such as fingerprint data are high-dimensional vectors with as many as ten to one thousand or so dimensions. Out of the processing relating to authentication, processing that specifies feature vectors with a small degree of dissimilarity has an especially high load. To reduce this load, a method that converts the feature vector to binary data of a specified length (a bit string of a predetermined length) and narrows the search to feature vectors with a small degree of dissimilarity based on the Hamming distance between the binary data strings has been proposed.
One method of converting a feature vector to binary data uses hyperplanes to bisect the feature space and decides each bit value according to which of the two partial spaces divided by a hyperplane the feature vector is positioned in. When N hyperplanes are used, N bits of binary data are obtained from one feature vector. Note that N is set at a sufficiently lower value than the number of dimensions of the feature vector.
As another method of converting a feature vector to binary data, a method that searches for neighborhood data (i.e., unstructured data with a small degree of dissimilarity) for the query data (i.e., the unstructured data used as a search key) has been proposed. As a method of improving the search precision, a method that searches for neighborhood data using a symbol string produced by inserting a wildcard symbol (i.e., a symbol that is determined to match regardless of whatever symbol is present at the same position in the data being compared) into binary data has been proposed.
A method that generates a feature value, which expresses digital data using a real number vector with D dimensions (where D>0), and generates a hash function based on relative geometric relationships in the proximity of the feature value has also been proposed.
Note that regarding the symmetry of a similarity-based relationship, a method that selects a favorable predictor based on an asymmetric similarity-based relationship has been proposed. With this method, feature representations of training clusters and a transformation matrix used to transform the feature representations are used to select a predictor. The transformation matrix maximizes, for a pair of training clusters, the asymmetric degree of similarity between the feature representation of one training cluster and the feature representation of the other training cluster after transformation. With this method, Kullback-Leibler Divergence (KLD) is used as the asymmetric degree of similarity.
See, for example, the following documents.
Japanese Laid-open Patent Publication No. 2013-206187
Japanese Laid-open Patent Publication No. 2012-173793
Japanese Laid-open Patent Publication No. 2015-079101
A. Torralba, R. Fergus, Y. Weiss, “Small codes and large image databases for recognition”, 2008
By converting a high-dimensional feature vector to binary data and using the Hamming distance between binary data strings to narrow the selection of feature vectors to be searched and matched against a query, it is possible to speed up a similarity search for unstructured data. It is also possible to apply this technology to a similarity search for unstructured data in a variety of situations where high-speed processing is demanded, such as an authentication system that uses biometric data (which is unstructured data) like fingerprints, veins, or voiceprints.
However, when handling unstructured data, such as biometric data, that is easily affected by the collection environment, environment-based effects sometimes appear as errors in the similarity search. As examples, fingerprint data changes depending on how dry the environment is, while voiceprint data changes depending on factors like peripheral noise, humidity, and the state of the throat. When performing biometric authentication using image data, such as facial recognition or iris recognition, the image data is affected by changes in the skin due to physical condition, changes in expression due to mood, and lighting conditions.
Although it would be possible to suppress the effects caused by changes in environment by using feature vectors that do not include elements that are affected by environmental changes, it is currently difficult to find a suitable feature vector. Since it is assumed that elements in a feature vector will be affected by environmental changes, it would be desirable to develop a technology that suppresses the effect that environmental changes have on search precision.