1. Field of the Invention
The present invention relates to database systems. More specifically, the invention is directed to a system for improving the performance of retrieving visual information from a database.
2. Description of the Related Technology
The emergent field of Visual Information Retrieval (VIR) has a purpose of developing techniques and systems to extract and access the content of visual media, which include images and videos. It is intended that the users of VIR systems are able to search through repositories of visual information just as in a textual database but without requiring that the visual media objects be pre-annotated with textual descriptors.
An important class of queries of query posed to a VIR system is a similarity query, where the user presents a clue to the system as a description of what he or she is looking for. The clue may be posited by:
a visual example, such as a similar picture; PA1 a sketch drawn to depict the search item; PA1 a set of keywords that describe the search item; PA1 a set of additional criteria, such as whether the user is more interested in the color of the visual media object, the relative sizes and spatial arrangements of interesting regions within the visual media object, and so on. PA1 two users may use the same clue with different intents; PA1 a user may use different clues with the same intent; PA1 the system cannot determine which aspect of the clue contains the real intent of the user. PA1 1. Current VIR Systems: There are several types of VIR systems currently available as research prototypes or commercial systems. FIG. 1 shows a block diagram of a computer-implemented system 100 with VIR capabilities. The system includes a computer, such as an Indy available from Silicon Graphics, Inc. Other computers, such as those based on the Alpha processor from Digital Equipment Corp., the Pentium family of processors from Intel, or the SPARC processor from Sun Microsystems, could be alternatively used. A visual display 104 is connected to the computer 102. User input 106 is received through any of a variety of input devices such as a keyboard, a mouse, or other pointing devices (not shown). A visual information retrieval system 110, a database management system (DBMS) 112 and an operating system (OS) 114 are also part of the computer-implemented system 100. A database 108 interfaces with the DBMS 112 and OS 114. The DBMS 112 is available from many vendors, such as the Universal Server marketed by Informix Software Inc., the Oracle8 Universal Data Server marketed by Oracle Corp., the DB2 system available from IBM, or the Sybase DBMS. The OS may be any of a variety of available operating systems, such as Unix, Microsoft Windows NT, or Solaris. PA1 2. Text Retrieval Systems: Many text retrieval systems use the notion of "query expansion", where an original query term is substituted by a list of other closely related terms. These close terms may be generated by adding suffixes using grammatical rules, by a set of synonyms found using a thesaurus lookup, by a relational knowledge representation structure like frame-networks, or by a natural language processing technique which helps the system disambiguate the word sense. Since these methods have been useful for complex queries in text retrieval systems, equivalent methods have been developed for use in VIR systems. However, because of the very different characteristics of text and visual information, any direct implementation of these methods for visual media is not meaningful. Explicit visual thesauri, although alluded to by some researchers such as D. Romer, Presentation at the seminar entitled "Towards a research agenda for cultural heritage in information networks", The Getty Art History Information Program, 1995, are not yet a reality, because the specification, implementation and use of such a mechanism has not been demonstrated. In fact, the only direct reference of a visual thesaurus in Romer, namely G. L. Lohse, K. Bilosi, N. Walker and H. H. Reuter "A Classification of Visual Representations", Communications of the ACM, Vol. 37(12), pp. 36-49, 1994, researches in a taxonomy of user perceptions of graphical entities like charts, graphs, icons, and pictures, and has no "synonym-" or "antonym-" like functionality that the notion of "thesaurus" implies. PA1 3. Current Computer Vision and Pattern Recognition Research: Recently researchers in computer vision and pattern recognition have approached the problem of visual information retrieval from a semantic object modeling and retrieval point of view. These research prototypes use one or more feature extraction techniques to model semantic concepts such as fire, people, and sky by user provided examples, and then use these models for retrieval. For example, some of the prototypes segment images, combine color, texture and shape information into composite groupings, and conduct a learning procedure from user-specified labels to one cluster per concept.
In all of these cases, the user expects that the system will locate those visual media instances that match the clue from a database. To the user, the performance of a VIR system is primarily judged by the degree to which the retrieved results correspond to the intent of the user-specified clue. However, it is very difficult to capture the intent of the user in a generic VIR system because of numerous reasons, such as:
It has been generally argued both in the database literature, e.g., S. Abad-Mota and C. A. Kulikowski, "Semantic queries on image databases: the IMTKAS model", Proc. Basque International Workshop on Info. Tech., San Sabastian, Spain, pp. 20-28, 1995, and in the information retrieval literature, e.g., W. B. Croft and R. H. Thompson, "IIIR: a new approach to the design of a document retrieval system", J. Amer. Soc. Info. Sci., vol. 38, pp. 308-404, 1987, that the retrieval performance improves if the system is knowledge-based. Knowledge here refers to a body of facts about the application domain, which is independent of media data. A VIR system is knowledge-based if some mechanism of the system utilizes this body of facts for extraction and/or retrieval of visual information. The addressed problem is: what form of knowledge is suitable for a VIR system and how to encode this knowledge into a generic VIR system in such a manner that the knowledge encoding module is flexible, customizable and can be easily changed if the user bias or application domain changes in a predictable manner. In this case, the reduction in search time as the criterion of performance as used by database researchers is not referred to, but to the quality of results as measured by precision (ratio of the number of correct results to total number of results retrieved by the system), recall (ratio of number of correct results retrieved by the system to the total number of correct results that should have been retrieved by the system) andfallout (fraction of false positives retrieved by the system) used by the Information Retrieval community. Current VIR systems (discussed further hereinbelow) until now do not have such a flexible knowledge-encoding module to improve an accepted effectiveness metric such as the well known Van Rijsbergen's formula F(X)=[(X.sup.2 +1) * P * R]/[X.sup.2 * P+R], where * indicates multiplication, P and R denote precision and recall respectively, and x is a free parameter to quantify the relative importance of precision and recall.
What is needed is a visual knowledge encoding mechanism that solves the above described problem. It would be desirable for such a mechanism to be utilized as a plug-in component in a VIR architecture.
There are three categories of work that are related to the current invention: current VIR systems, text retrieval systems, and computer vision and pattern recognition. The majority of this work relates to recently emerging technologies. These categories are described as follows:
FIG. 2 shows the general structure inside the VIR system 110. The VIR system 110 preferably includes a user interface 130, an insertion module 132, a database interface 134 and a query module 136. The user interface 130 receives user input and queries 106 (FIG. 1) by use of one or more input devices. The user interface 130 sends an image to be inserted into the DBMS 112 (FIG. 1) and metadata to the insertion module 132. Metadata is information about the image, particularly information that is useful for query purposes. Examples of metadata include the date a photographic image is created and the name of the photographer. The insertion module generates a feature vector for the image and sends the feature vector with the metadata to the database interface 134 for storage by the DBMS 112. The user interface also sends a user query to the query module 136. The query module 136 processes the query to generate a database query that is send through the database interface 134 to the DBMS 112. The database interface 134 receives results back from the DBMS 112 and sends the results to the user interface 130 through the query module 136. PA2 For the current VIR system category, there are four main types of systems: keywords only, generic image analysis, domain specific image analysis, and keywords together with image properties. These four system types are described as follows: PA2 Such a method has been used for video retrieval described by T-S. Chua and L-Q. Ruan, "A video retrieval and sequencing system", ACM Trans. Info. Syst., vol. 13(4), pp. 373-407, 1995. PA4 FIG. 3 shows a block diagram of a keyword-based image/video retrieval system 150. A query interface 152 receives keywords supplied by the user of the system 150. For this system 150, the query consists of a set of terms, i.e., keywords or phrases, which are connected together by operators, such as AND, OR, NOT and NEAR, as permitted by the underlying text database. The query is parsed by a query parser 154 to generate a query in text database format. The query is then "processed" by a text database 156, which returns the ranked list of asset identifiers (IDs) of those images that qualify the query conditions to a result processor 158. The result processor 158 then fetches the corresponding images (or their thumbnail representations) or video frames from an image repository 160 and presents the results to the user by use of a results interface 162. PA4 The problem with these keyword-only systems is that they are extremely dependent on the annotation. Hence, a user searching for "sunset" will never retrieve an image annotated with the keyword "San Francisco" even though the image may actually show a sunset in San Francisco. What is desired is to use keywords if available, but to retrieve a visual media instance if it matches the same concept as intended by the query keyword, although the instance retrieved may have different or no keywords associated with it. PA2 While these methods also attempt to capture user intent by associating features with concept, they are targeted at objectives different from the current invention. First, the mapping of one cluster to a concept is not a practical database solution to a VIR system. It is impossible to devise a clustering algorithm that will put a blue sky and a red sky of dusk in the same cluster, without misclassifying it with a lot of "non-sky" regions having smooth textures. Secondly, while the emphasis of these groups is in model building through application of suitable learning techniques for grouping features, what is desired is exemplar collection, and not model building. Although model building by learning is one way to create a victionary, the crucial point is to have a number of example cases, which would trigger a number of sufficiently diverse "expanded sub-queries" that form disjunctive descriptors of the user intent. Thirdly, the focus of this invention is in query processing in VIR systems, and on the functional requirement specification of the victionary that aids the query process, and not on any specific implementation that may yield the specifications covered herein. For the purpose of this invention therefore, the existing research ideas are potentially deployable tools, but not conceptual building blocks.
a) Systems that use keywords only: Many commercial systems associate keywords with images. The user places a search request by typing in some query words, and the system retrieves the images containing the same words. A variation of this is to use a text thesaurus. In this case, the system searches for the keywords and their synonyms. Hence a query for "car" will match an image record annotated as "automobile". PA3 b) Systems that perform generic image analysis: Many research systems, such as the projects at Columbia University, University of Chicago, University of Massachusetts and University of California-Santa Barbara, and a few commercial systems, such as the Virage VIR Engine, Excalibur Visual Retrievalware and IBM QBIC, use image processing techniques to extract some properties from the image. Among the most common properties are color, texture and shape of user-designated image regions. These abstracted properties are stored in a composite data structure called a feature vector. Applicant's VIR system for image analysis is described in copending application Ser. No. 08/829,791, filed Mar. 28, 1997, for "SIMILARITY ENGINE FOR CONTENT-BASED RETRIEVAL OF OBJECTS" to Jain, et al., which is herein incorporated by reference. PA4 FIG. 4 shows a block diagram that illustrates the structure and function of a typical VIR system 200 that performs image analysis. The database 112 (FIG. 1) stores a set of feature vectors of images that are stored therein. The user presents a query through an interface 202. The query specifies an example image, and a set of weights, which indicates the user's relative preferences of individual properties. A query parser 204 converts the user's query into an internal form (query structure) that a query processor 206 can recognize. This may involve analyzing the example image to extract its feature vector. The query processor executes the query by retrieving feature vectors from the database (using a database interface 208), comparing them against the example feature vector and creating a score sheet indicating the similarity of each retrieved image to the example image. A result processor 210 uses this score sheet and the user defined weights to compute a final ranked list. Images (or their thumbnail representations) corresponding to the ranked list are retrieved from the database interface 208 and presented to the user through the results interface 212. The VIR system 200 may include a query refinement interface 214 that sends refinement parameters based on user input to a query refinement processor 216 for modification of a query. The query refinement processor 216 may send a new query structure for refinement of the database query to the query processor 206 to produce a new set of query results and/or may send a refinement request to the result processor 210 to refine the prior set of query results. One type of query refinement is weight manipulation. PA4 Although these systems work well for query-by example cases, they are very dependent upon the properties of the example that is used as the clue. Hence, for these systems, if the query image shows a white bird flying in blue sky above some trees, the system may not ever retrieve (or rank high) an image of a bird silhouette flying against a dark red sunset over a sea, because the characteristics of these two images are very different. If the user's intent was to find a flying bird, the heavy dependence on the query example will lead to unsatisfactory performance. PA3 c) Systems that perform domain specific image analysis: Some very domain-specific systems, such as the face-recognition system of Pentland (U.S. Pat. No. 5,164,992), use prior training to enable the query engine to find similar images that are intuitively meaningful. However, these systems work only within the scope of the intended domain, are not easily extensible, and are not suitable for a database with a heterogeneous content. PA3 d) Systems using keywords and image properties: A few recent systems, such as the Photodisc system, have started using textual image descriptors together with computed image properties. These systems accept both textual query terms and an image example as input. Typically, they first issue a text query as in the keyword-only systems described above, and get back a set of image identifiers that satisfy the query. Next, they take this resulting set of images and perform an image similarity query, such as performed by the generic image analysis systems described above. These systems work very well for properly annotated databases, but suffer the same defects as the keyword-only systems and the generic image analysis systems. PA4 An interesting variant of this theme was used in the Japanese patent application to Akio Hamano (Kokai Patent Application No. HEI 2-51775 [1990]) of Nippon Telegram and Telephone Public Corporation. The Hamano patent starts with the assumption that each image shows a number of objects, and for each object, the name and spatial properties of the objects (size, position, and attitude) are stored by the system. Further, the system stores a "schematic diagram" for each image to depict the relative positions (e.g., "fore and aft relationship") of these objects. The user initiates a query by selecting an image, and inputting a keyword, such as "apple". The system retrieves a default property list for the object having the keyword. It also presents the schematic diagram data corresponding to this keyword for the current image. The user is allowed to interactively modify either the property list (e.g., make the apple bigger) and the relationships depicted in the schematic diagram (e.g., move it from the center of the desk to a corner"), through a graphical interface. When all objects pertaining to a desired image are thus modified, the system computes a "similarity criterion" from the edited image to search the database. For example, if the position of the apple was modified from p1 to p2, it creates a rectangular region around p2, to look for an apple in the database. It is important to note that the system is still looking for an apple using keyword annotations of the images in the database. The characteristic of the patent is that when it finds a candidate image in the database containing an apple by keyword search, it screens the image by comparing the similarity criteria set by the user with the schematic diagram and properties of the apple in the candidate image.