The methods in this invention stem from five principle backgrounds. Image segmentation is the action of algorithmic or user controlled identification of a region or regions of one or more images. Image structuring involves automatically or semi-automatically recognizing and structuring the contents of an image. Computer Aided Design (CAD), Computer Aided Manufacture (CAM) is a process by which tools are used to design and construct 2 D and 3 D environments. Context Based Image Retrieval (CBIR) uses image measures including segmentation to classify and retrieve image data. Finally, lighting removal is a means by which the effects of light and shadow can be detected and removed from an image.
Image Segmentation
Image segmentation algorithms have the goal of partitioning an image into regions that are usually non-overlapping. At their most basic, they fall into primary categories of thresholding/clustering, edge detection and area-growth based segmentation, or some combination thereof. Clustering and thresholding involve taking measures of the whole image, and seeking some form of grouping of these measures, and then applying that to group the image itself. Edges are roughly continuous thin portions of an image that exhibit some peak discrepancy between image pixels on either side of the boundary. Area based segmentation involves applying criteria to groups of pixels, that if satisfied result in them being classified as part of an area. Adobe Photoshop appears to use area growth as a means for selecting complex regions in an image using the magic wand brush.
The advantage of clustering and thresholding is that the entire image acts as a sample. If the sample is large, new information can be discerned. This is also its disadvantage, in that it fails to take into account the spatial information of an image. In addition, clustering algorithms such as k-means require as input an expected number of means, whereas images may have a varying number of actual distinct groupings of data. Boundary and edge detection excels at finding region borders, but can be overwhelmed by noise or heavily textured areas. Finally, region growth can overcome noise, but it may be difficult to determine when to terminate growth, and as the measure is area based, the sample size can blur sharp boundaries.
More recent techniques have concentrated on combinations of these approaches. Malik (2001) introduces the scheme of graph cuts to partition the space of a large number of filter results, in combination with an area-based measure for detecting boundaries. This still requires an input guess for the k-means algorithm, where for this algorithm k=25 is recommended.
Another approach to segmentation is to limit the set of initial conditions. U.S. Pat. No. 5,642,443 to Goodwin uses color and (lack of) texture to indicate pixels associated with sky in the image. In particular, Goodwin utilizes partitioning by chromaticity domain into sectors. Pixels with sampling zones along the two long sides of a non-oriented image are examined. If an asymmetric distribution of sky colors is found, the orientation of the image is estimated. This may work well in specific instances, but sky regions tend to be easily identifiable in contrast with ground. The field of our invention is the identification of more complex regions, so this approach is inappropriate.
US Published Patent 20030053686 to Jeibo details a system whereby red green blue pixels within an image are assigned belief values per pixel rating their belonging to subject matter based on color and texture. These values are then thresholded to find candidate regions, and the regions further analyzed using unique attributes of each region to generate a resulting map of regions. This allows usage of the global measures similar to those applied in Malik (2001), with the addition of a local pass. To be effective, the color and texture must be known or a wide range calculated. The goal of the paper is detection of sky data, and hence the range of measures may be limited. In addition, a classifier is still required to partition the various measures into classes. 20030053686 to Jeibo proposes a neural network, which is trained on a series of sky images. Detection of regions in general images would require a significant amount of training, so this method appears best suited to specific, well known and highly defined tasks.
Computer Aided Design
Within the domain of 3 D CAD/CAM (Computer Aided Design, Computer Aided Manufacture), a 2 D or 3 D space containing data representations is presented to the user, with the objective of fashioning more 2 D or 3 D data in accordance with a set of user commands. The data is normally entered by the user, or derived from image data. In many current embodiments, the derivation of the data from an image is a separate process from that of altering and interacting with the data. For example, RealViz Image Modeller may be used to identify objects and geometry within an image. However, the objects and geometry are then exported to an application such as Alias Systems Maya for manipulation.
Shape Recognition and Extraction
The process of recovering three-dimensional structure from two-dimensional images is an ongoing field of research in Photogrammetry and Computer Vision. No general-purpose techniques exist to completely accomplish this task, but there are many specific techniques for doing so. One means of classifying such techniques is into two major categories, namely those that provide results either with or without user assistance. For the purpose of this disclosure, applications requiring user assistance will be the focus of the background material as they contribute to the field of the invention.
A large body of work exists toward automatic extraction and recognition of shapes, objects, and groups from images. The primary goal of such work is toward computer vision, where an autonomous agent may recognize and manipulate a location through automated conversion of images into object models.
An early approach to manual extraction of objects is detailed within (Lin and Nevatia 1995). They describe a process whereby multiple objects are extracted from a single oblique image. A hierarchical perceptual grouping process is used to generate 2-D roof hypotheses from fragmented linear features of the input image, which are then verified. A 3 D description of the building may then be derived. This is helpful for identifying simple planar objects, but they do not detail an application to more complex shapes.
Ylä-Jääski and Ade (1996) detail a methodology for shape recognition for industrial robotics using either monocular or multiple images. Edge information is determined within an image. The edges are grouped by use of symmetry into linear segment pairs (LSP). Multiple LSP are grouped to make more complex shapes that are considered to be objects, which may then be compared with a known object database. For example, the tines on a fork are recognized as multiple parallel structures that are grouped together, and then matched against descriptions of forks. This method works under idealized conditions, where the object is easily extracted from a background, but would be far more difficult to apply in arbitrary environments.
Debevec et al (1996) describe an approach for extracting three-dimensional shape from multiple images by a human operator. The user applies geometric primitive operations to views of the objects, fitting shapes such as edges and cubes to remain consistent between multiple views. This is easily applicable to simple geometric objects, where many of the constituent planes are easily identified and are oriented at common angles—for example 90 degrees. It is difficult to apply in the presence of more complex and soft objects, such as trees or pillows, and also requires multiple images.
Liebowitz et al (1999) details the construction of objects from single or multiple images. A perspective function is identified for the image using detected vanishing points and lines, and then objects within it may be identified. This is similar to Debevec et al (1996), but works for single images. In addition, images can be improved upon. In the example of La Flagellazione di Cristo, the floor is touched up by use of the symmetrical pattern upon it, and the ceiling is reconstructed from a partially visible segment. As with Debevec et al (1996), it is difficult to apply in the presence of more complex and soft objects
Oh (2001) details an approach to manually extracting the three dimensional shape from images. Their paper claims a time saving over existing techniques and applicability to single images, primarily by using an imaging editing or painting metaphor. This compares with Debevec et al (1996), where multiple images are a requirement, and geometric extraction is paramount. However, the approach detailed in Oh (2001) is still very intensive, requiring a large number of user-applied operations to extract shape outlines and apply depth in their example. The church example required 10 hours of labor.
A similar technique is introduced in Strum and Maybank (1999) whereby interactive 3 D reconstruction from a single image is achieved via piecewise planar objects. Camera calibration and 3 D reconstruction are done using geometrical constraints provided by the user. The user identifies points within an image that are co-planar. Line segments are used to outline segments of the image, which are mapped to 3 D planes. Finally, VRML 3 D objects are automatically generated from the segments, and textured using the image segment bitmap data.
Decoupling Illumination From Texture
Editing of images is often complicated by the illumination of the content. Consider attempting to replace an existing material in an image with a new material, or placing a new object in the image. The application Growlt Gold Garden & Landscape 2, by Innovative Thinking Software, is a classic example of the issue. Plants may be placed on top of a photographic image of an existing landscape. However, it is obvious in example images that the plants belong in a different photograph, as their brightness and the direction of illumination differs markedly from that of the base image.
Oh (2001) details an approach to dealing with the replacement of materials. They make the assumption that large-scale luminance variations are due to the lighting, while small-scale details are due to the texture. They use a Gaussian filter to separate the illumination component from the surface texture. They do require that the user manually identify the feature scale, orientation and perspective effects of each surface. This can take significant time. They do not claim that these are the true texture and illuminance, instead providing a way of factoring low frequency variations. Fine shadows are not preserved.
Content Based Image Retrieval
In the field of Content Based Image Retrieval (CBIR), a variety of approaches have been detailed to specify and extract information from images such that useful information is more easily accessible from a large amount of image data. Content comparison is an important issue in CBIR. In general, such comparisons are performed either globally using measures applied to the entire image or locally based on the structure or composition of regions/objects gained from image segmentation. A large body of work exists, and due to limited space we review those works most related to our own. Region based search is covered in greater depth, as it more directly relates to the utilization of the region segmentation process we propose.
One example of a global technique is the Photobook project Pentland et al (1994) at the M.I.T. Media Lab. Global compositional methods are applied, such as foreground extraction. Images are classified at load time as having “face”, “shape”, or “texture” properties, and such encoded information is searched during queries. Such a classification system is limited by the accuracy of the classifier methods applied.
The QBIC project described within Niblack (1993) uses image analysis to process queries for an image database. Color, shape, and texture measures are applied to match images in the database to a user's query. The user may specify queries by sketching shapes, or from a list of textures or a color wheel. The system then seeks content matching the users input. Selecting from a finite list introduces problems, as many desirable classes of data may not be present. A large range of options may solve this. However, this large range can overwhelm the user with sheer number of choices. The choices may be grouped for easier user access, but the user would have to be familiar with or be trained as to the classification methodology.
Carson et al (1996) describes a method that allows selection from lists of attributes such as color. The user still has to find the attributes in lists, and may either not find a good match or have to browse a huge range of options. It would be much better to point to an object or material/color in an existing image and specify that is what should be searched for.
U.S. Published Patent Application 20030179213 to Lui teaches a method for automatic retrieval of similar patterns in image databases. The method is global in approach, and is primarily aimed at video image retrieval. Color is measured through a global histogram, and feature content is measured through wavelets. The results of both are combined and weighted for a global measure. The histogram technique is illumination invariant, and the wavelet approach is robust under rotation. This approach allows a comparison of overall image similarity. Segmentation is not involved, and hence no search for a sub-part of an image matching a query is possible.
A major drawback of the global search lies in its sensitivity to intensity variations, color distortions, and cropping. In addition, we desire to find potentially local phenomena, such as matching a specific fabric in one image with similar fabrics in other images, where the pieces of the image exhibiting the desired characteristics may be small. Hence, region based search is more desirable. Some region based search approaches are described below.
Wang et al. (2001) details a system known as SIM-PLIcity (Semantics-sensitive Integrated Matching for Picture Libraries), an integrated region-based image retrieval system. It uses semantics classification based on the region representations of the image fragments. A similarity measure between image fragments is computed based on multiple region representations. It incorporates the properties of all the segmented regions so that information about a fragment can be fully used, and to overcome errors apparent in segmentation algorithms. They cite an example error or ambiguity of a dog being segmented as one region, or as legs, body, head and other regions. However, this improvement in accuracy does not solve the issue of searching for specific components of an image, and the benefit of their technique lies in avoiding these types of queries as the algorithm works best when multiple region results form the basis of a query and search. Their similarity measure is used to compare the similarity of multiple regions, rather than in identifying the regions themselves. This “soft-similarity” is more robust in the presence of individual region segmentation errors by considering multiple regions, but sacrifices the ability to use individual region or even sub-region information to specifically classify and compare images.
Chen and Wang (2002) describe a method for searching for regions to characterize an image. They use a k-means segmentation technique, which requires prior judgment of the value k or some resulting inaccuracy if a more general value of k is used. They avoid the issue of poor segmentation by applying fuzzy logic to match regions identified, and their method is aimed at that of matching entire images. Many sample regions are required to overcome the inherent inaccuracies in identifying regions, so that matching of a small portion of an image to another small portion is highly error prone. Hence, their technique is limited to entire images ideally, or significant areas of an image if greater inaccuracy is tolerated.
Software known as Blobworld is detailed within Carson et al (2002). It allows selection of specific features to match on, with segmentation utilizing k-means techniques. They modify k-means to account for photographic composition, with the assumption of a primary object or subject exhibiting a limited set of arrangements. To query an image, a user is provided with the segmented regions of the image and is required to select the regions to be matched and also attributes, e.g., color and texture, of the regions to be used for evaluating similarity. However, selection of regions does not necessarily match user expectations, especially if the segmentation is error prone. The k-means composition assumption limits the range of images this may be applied to, as they propose composition based on a simple set of pre-defined arrangements. This assumption would break if other forms of composition were applied. In the field of our invention, there may be no defined main subject or section, or arbitrary numbers of subjects may be present, as the image may be composed of a variety of furnishings or plants.
U.S. Pat. No. 6,584,221 to Moghaddam allows the identification of images in a database based on a user specified region of interest. This is achieved by division of images into blocks based on color and texture, and then use of the combined distributions of each of those blocks as a means of comparison. Segmentation is rudimentary, as the image is partitioned into blocks rather than irregular segments. In the preferred implementation the blocks are of size 16×16 pixels. The problem with this approach is that the blocks may overlap boundaries of image regions, resulting in incorrect image measures. In addition, the nature of the block size implies that if they are to encapsulate a given feature scale for a given region, the block size will prove wasteful or too small for accurate neighboring region classification if the scales differ between these regions.
Problems
Existing techniques that match on regions tend to either provide potentially inaccurate results due to the inaccuracy of the segmentation, or apply measures of multiple segments to reduce the inaccuracy, at the expense of specificity. This precludes the possibility of queries based on a single region, unless by chance multiple segments in both source and destination contain the characteristics of that region. The accuracy of matching of specific regions would be improved by the application of better segmentation techniques, designed to facilitate this goal.
In addition, inaccurate results from segmentation techniques lead to problematic usage in image editing applications. User identification of a desired segment is uncertain to achieve a desired result, as segments may cross the boundaries of that which the user considers a discrete region, or may over-segment images into a confusing patchwork which the user must merge to make useful. It is difficult to create a segmentation technique that would match user criteria, but a technique that instead identified salient sub-regions may have greater utility.