The exemplary embodiment relates to matching text to images and finds particular application in a system and method for performing one-directional matching which preserves the privacy of information in the image.
In “text-to-image” matching, the aim is to retrieve images in a dataset that match an input text query. One example of this is in license plate matching. In this case, given a license plate number in text format and a dataset of images of license plates, the aim is to find the images of license plates in the set that match the text query.
One method for performing such a matching involves embedding the text and images in a common vectorial space and learning a compatibility function between them (J. A. Rodríguez Serrano and F. Perronnin, “Label Embedding for Text Recognition,” BMVC, pp. 5.1-5.12 (2013) and U.S. Pub. No. 20140219563 to Rodríguez Serrano, et al., hereinafter, collectively referred to as “Rodríguez Serrano 2013”). Although the compatibility function is trained with the objective of maximizing the “text-to-image” matching accuracy, it also performs well in “image-to-text” tasks, i.e., given an image signature, the image's transcription (contained text) can be retrieved. Although this property can be useful in some applications, in some cases it can be problematic. For example, if there were to be a security breach and the signatures of the images were leaked, someone with knowledge of how the system works could exploit this property to find the text matches for the image signatures. If the signatures encode nominal (personal) data, this could provide unauthorized information.
Other approaches for text-to-image matching include transcribing all the dataset images, and performing a text to text matching using a text distance such as Levenshtein distance (V. Frinken, et al., “A novel word spotting method based on recurrent neural networks,” IEEE Trans. on Pattern Analysis and Mach. Intel. (TPAMI), 34:12, pp. 211-224 (2012). It is assumed that either the actual dataset images (not just the signatures) or the transcriptions are available on the system, and that privacy is not an issue. Making the image-to-text matching difficult would therefore render the system useless. In another approach, synthesized images are generated from the text queries and an image to image matching is performed (J. A. Rodriguez Serrano, et al., “Handwritten word-image retrieval with synthesized typed queries,” Intl Conf. on Document Analysis and Recognition (ICDAR), pp. 351-355 (2009)). This method does not require storing the images or the transcriptions of the dataset on the system, only their signatures. While synthesizing image queries from the text offers some privacy guarantees, if the synthetizing technique were to be leaked, an attacker could synthesize its own text queries and match them against the stored signatures. Also, since synthetic images need to match the style of the dataset images, which may be unknown, this approach may lead to lower performance accuracy in some cases.
There remains a need for a method which allows high accuracy “text-to-image” matching but which limits the ability to perform “image-to-text” matching on the data.