With the rapid development of Internet, different types of multimedia data such as images, text, video, and audio have grown rapidly, which often appear at the same time to describe the same thing. The information of different modalities reflects different attributes of things, and people obtain information of different modalities to satisfy the desires to describe things in different forms. For example, for an image, we want to find a description of the text associated with the image. For a piece of text, we like to find an image or a video that matches the semantics of the text. To meet these needs, relevant technologies for cross-media search are required.
Most of the existing search systems are based on single modal text information, e.g., search engines such as Google and Baidu. The function of searching images, audio and video by query request is essentially matching the content on a meta-database composed of text information. This type of search still belongs to the traditional keyword-based search technology. Although keywords can accurately describe the details of a concept, it is difficult to present a picture or a piece of video in a complete and vivid manner, and the text description may carry subjectivity of the labeling person. Subject to this inherent flaw, many scholars turn to the study on content-based search technologies, enabling computers to more accurately understand the content of multimedia information by fully mining the semantic association of multimedia data. However, content-based search generally only focuses on the underlying features of the media and is usually subject to a single modality media object. Query and search results must be in the same modality, and comprehensive search cannot be conducted across various media types. Therefore, the concept of cross-media search was proposed. Cross-media search is a medium that does not rely on a single modality, which can realize mutual search between any modal media. By inputting information of any type of media, one can get related media information through cross-media search and search results that meet the requirements in the huge amount of multi-modalities data more quickly.
Existing cross-media search methods mainly concern three key issues: cross-media metrics, cross-media indexing, and cross-media sorting. Typical methods for these three key problems are cross-media metrics methods based on matching models, cross-media indexing methods based on hash learning, and cross-media sorting methods based on sorting learning as follows:
First, a cross-media metrics method based on matching model. The matching model is trained by the training data of the known category to mine the internal relationship between different types of data, and then the similarity between the cross-media data is calculated, and the search results with the highest correlation are returned. There are two matching methods: one is based on correlation matching, such as the method using Canonical Correlation Analysis (CCA); the other is Semantic Matching (SM), such as using multi-class logistic regression for semantic classification.
Second, a cross-media indexing method based on hash learning. Subject to massive amounts of big data on the Internet, people have put forward higher requirements for search speed. Hash index is an effective way to speed up Approximate Nearest Neighbor (ANN) search. The method converts the original feature data into a binary hash code through the learned hash model, while maintaining the neighbor relationship in the original space as much as possible, that is, maintaining the correlation.
Third, a cross-media sorting method based on sorting learning. The purpose of cross-media sorting is to learn a semantic similarity-based sorting model between different modalities. The specific method is to make a better sort of the search results after searching the semantically related cross-media data, so that the more relevant data is more advanced, and the optimization process is continuously iterated until it is convergent to obtain the optimal search results.
Of the above methods, almost all image features and text features used are traditional artificial features, such as SIFT features. With the continuous improvement of computer processing performance and computing capacity, these traditional artificial features have inhibited the improvement of cross-media search performance. In the past year, people began to focus on the combination of related technologies of deep learning and cross-media search. It is shown that the effective application of deep learning can often bring breakthroughs to the effectiveness of searches.