The importance of image recognition in our society is growing day by day, as computers and the virtual sphere take root.
The field of application of visual search engines and computer vision, object and pattern recognition technologies is broad, and has spread to a wide range of different uses and sectors, such as: industrial and machine vision, navigation, process control, homeland security, e-commerce, medical diagnosis, biological research, people identification and biometrics, marketing, social networks, etc.
In particular, the use of visual search for identification and similarity is a field with multiple interests, where its commercial applications have been developed over the past decades due to the increase of digital images and video, and the use of Internet with the latest technologies in Smartphone's, tablets, etc., including built in cameras that are more and more advanced.
A first approach to solve the visual search problem was “text-based retrieval”, where images are indexed using keywords, tags and classification codes, or subject headings. Limitations associated with related art technologies are two-fold: first, images need to be indexed and labeled, entailing a great deal of time and resources, and second, it is not a standard method, as each user can subjectively interpret, define and describe images in a different way.
An alternative to text-based retrieval is Content Based Image Retrieval (CBIR) technique, which retrieves semantically-relevant images from an image database, based on automatically-derived image features.
Image processing is rather complex; apart from the volume it takes up, there is a real challenge in efficiently translating high-level perceptions into low-level image features, and solving the well-known semantic gap. These technologies may seek to address the following:                Decreasing response time        Increasing accuracy        Simplifying queries for image retrieval        Increasing robustness and invariance to different environments, image capture conditions, and viewpoint changes        Scalability to volume, time, and image nature; to large databases that change and increase in real-time, and flexibility and extendibility to other types of objects, images, and/or patterns.        
One of the crucial points for CBIR systems to work properly is the definition and extraction of the image features, i.e. the selection of optimal and appropriate vectors, also called feature descriptors, describing, as completely and accurately as possible, the image or region of interest's visual information, with the minimum amount of necessary data. The purpose of this is to recognize, identify, sort and classify the query image or object of interest, with those identical and similar to it, through efficient search and comparison methods, applied over large image databases.
Technologies of the field developed so far, are commonly based on direct 1:1 comparisons, pattern matching, or correlation methods applied to entire images/objects, or partial image windows/regions of interest (ROI). Such approaches are accurate, and are well-suited to recognize the global structure of specific objects, previously known, and for a limited and trained database, but cannot cope well with partial occlusion, significant changes in viewpoint, or deformable transformations (from K. Grauman and B. Leibe Chapter 3 Local Features: Detection and Description. Visual Object Recognition. Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan & Claypool (2011)). Furthermore, they are usually not robust to illumination changes or noise presence from neighboring elements, making these systems' scalability and flexibility, very costly, and therefore, their CBIR applicability, quite questionable.
Another key factor to define the right CBIR descriptors is that they should be invariant, meaning that they should not be affected by parameters that are sensitive to different image or object capturing conditions and environments, such as illumination, rotation, scale, reversion, translation, affine transformations, and other effects.
Alternatively, there are efforts to develop CBIR systems implementing invariant low-level feature based descriptors to, on one hand, robustly describe images or objects in different capture contexts and conditions, and, on the other, to avoid the use and analysis of high-level features, which are more complex and costly, both in terms of implementation and necessary energy consumption and processing.
The use of these low-level feature vectors, consists of indexing visual properties, using numerical values to describe these features, representing the image or object as a point in an N-dimensional space. This process consists of extracting the query image or object vector features, and applying the metrics and classification methods to analyze similarity in terms of the database.
Currently there are algorithmic methods for extracting this type of invariant features from images, such as Scale-Invariant Feature Transform (or SIFT), G-RIF: Generalized Robust Invariant Feature, SURF: Speeded-Up Robust Features, PCA-SIFT, GLOH, etc. However, these methods describe the concrete, local appearance of objects or image specific regions, selecting a set of points of interest, usually obtained with machine learning and training methods applied over previously known limited databases, meaning that they are not extendable to other objects and categories without corresponding prior training.
In this context, challenges include specifying indexing structures that speed up image retrieval through flexible and scalable methods.
Thus, another alternative to low-level features is the use of descriptors of features such as color, shape, texture, etc., for developing generic vectors, applicable to various sorts of images and objects. Among the optimizing methods for the mentioned vectors/descriptors, the purpose is to obtain the maximum information while including the minimum number of parameters or variables within them. To this end, selection methods are used to determine the most important features and combinations thereof, in order to describe and query items in large databases, reducing the complexity (in terms of both time and computer processing) of search and retrieval, while attempting to maintain high performance accuracy. Moreover, this helps the end users by automatically associating the right features and measurements of a given database (I. Guyon and A. Elisseff. An Introduction to Variable and Feature Selection. 2003) Journal of Machine Learning Research 3 (1157-1182)). These methods can be divided into two groups:                Feature transform methods, such as principal component analysis (PCA) statistical procedure and independent component analysis (ICA) computational method, which map the original feature space into the lowest dimensional space, and construct new feature vectors. The problem with feature transform algorithms is their sensitivity to noise, and that the resulting features are meaningless to the user.        Feature selection schemes, robust against noise, and with resulting features highly interpretable. The objective of feature selection is to choose a subset of features to reduce feature vector length while losing the least amount of information. Feature selection schemes, according to their subset evaluation methods, are in turn classified into two groups:                    Filtering methods, where the features are evaluated based on their intrinsic effect and natural separation into classes or clusters.            Wrapper methods, which take advantage of learning method accuracy to evaluate feature subsets.                        
Feature selection in CBIR systems has been achieved so far with different approaches, based on machine learning and training methods, consisting in optimizing accuracy and results for tailored trained specific cases and database samples, which are therefore, not generally extendable to other or new cases and database samples not initially considered and trained, or to different sorts of image and object categories.
Of all these generic feature vectors, color and texture are two of the most relevant descriptors, most commonly used in image and video retrieval. As a result, companies and researchers have gone to great lengths to improve them and base their CBIR systems on them.
Color descriptor or color feature is a global feature that describes the surface properties of the surface of the scene, in terms of images, regions or objects thereof. The different ways to extract color features are explained in Lulu Fan, Zhonghu Yuan, Xiaowei Han, Wenwu Hua “Overview of Content-Based Image Feature Extraction Methods,” International Conference on Computer, Networks and Communication Engineering (2013).
Different color spaces are widely known for their application in CBIR and their advantages in identifying perceptual colors. No color space can be considered universal, because color can be interpreted and modeled in different ways. With a wide variety of available color spaces (e.g. RGB, CMY, Y IQ, YUV, XY Z, rg, CIE Lab, Luv, HSV, etc.) and a wide variety of descriptors for defining the colors of images and objects, it is not obvious which color space and which features should be measured in order to describe an image and be able to identify those identical and most similar to it. In this context, a question that arises is how to select the color model that offers the best results for a specific computer vision task. These difficulties are explained in detail in (H. Stokman and T. Gevers “Selection and Fusion of Color Models for Image Feature Detection” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 3, March 2007), where they suggest a generic selection model or models (invariant).
Most of this kind of descriptors, developed to date, have multiple limitations, as reflected in the recent publication by Lulu Fan, Zhonghu Yuan, Xiaowei Han, Wenwu Hua “Overview of Content-Based Image Feature Extraction Methods,” International Conference on Computer, Networks and Communication Engineering. (2013). The existing color descriptors are not usually able to describe local distributions, spatial localization and region changes in the image, and, in short, are insufficient for unequivocally interpreting, recognizing, classifying and identifying specific complex objects or images, specific high-level patterns, image regions and details, nor finding others which are close or semantically similar. Shape and texture descriptors need complex computational processes, or specific models with prior training.
In summary, there is a key dilemma when it comes to the goals pursued in descriptor selection and extraction for CBIR systems. When robustness, invariance, flexibility and scalability are sought, accuracy loses out. When accuracy is achieved, what is lost is robustness, flexibility and extendibility to other types of images, products or categories.
As a solution to, and evolution of these feature descriptors, the so-called high-level semantic descriptors have arisen, which attempt to interpret visual information in the closest way to our subjective human perception, in order to achieve descriptors that are simultaneously optimal in terms of accuracy, invariance, robustness, flexibility, and scalability, as our brain does when interpreting the visual world around us. However, these descriptors, which aim to get even closer to human intelligence, face barriers due to their algorithmic, computational and storage complexity.