The present invention pertains to a method of object recognition and, more particularly, to a method of object recognition that uses a human-like language, based on the vocabularies used in photointerpretation, to write solution algorithms.
In the art of object recognition, one usually extracts an object with image and/or map data by using one of three major methods: (1) a manual method in which an analyst extracts an object by using the human visual system, (2) an automated method in which the analyst relies totally on a machine system to perform the task, and (3) an interactive method in which the analyst determines a final decision, while a machine plays an assistant role.
Using a manual mode, an analyst does not need to employ a computer to do anything except to display an image.
In employing an automated system, once the data is entered into the machine, a machine system extracts the intended object. The analyst is merely a receiver of the data-processing results. In the event that the analyst is dissatisfied with the performance of the machine system, necessary changes can be made to the solution algorithms. In this automated mode, the analyst still has nothing to do with either the machine or the means by which objects are extracted.
In a conventional, interactive mode of information processing, the level of interaction between an analyst and a machine system can vary greatly. The least amount of interaction occurs when a machine system provides a set of solutions to the analyst, and the analyst selects or rejects one or more of the proffered solutions. On the other hand, the analyst can intensively interact with the machine by employing the following: (1) pre-processing image data by using a set of functions provided by the machine systems; (2) analyze the content of the scene by using a set of functions provided by the machine systems; (3) by utilizing the information provided in the aforementioned options, performing a set of object extraction options; and (4) evaluating each result and then selecting or rejecting a result.
In a conventional system utilizing intense interaction, an analyst is still either a mere operator or, at best, an effective, efficient user of the machine system. In other words, under these conditions, no matter how good the analyst is in the extraction or recognition of an object, conceptualized xe2x80x9calgorithmsxe2x80x9d cannot be converted into a computer-workable program.
Conventional feature and object extraction for mapping purposes is based on high resolution panchromatic images supplemented by color imagery. A feature boundary, such as the contact zone between a forested area and a cultivated field, is determined by using standard photo-interpretation principles, such as a Tone principle, a Texture principle, a Size principle, a Shape principle, and so on, based on one single image. The use of multiple images, such as a system with three graytone images representing the near infrared spectrum, the red spectrum, and the green spectrum, in determining an object boundary can be very confusing and time-consuming. Therefore, to be helpful, these multispectral imagery data sets must be converted to a single-band scene serving as a base image map (IM) for manually-based feature extraction.
In the past 30 years or more, image processing and pattern recognition have been centered on extracting objects using simple and complex algorithms within an image of appropriate dimensions, such as 128xc3x97128, 256xc3x97256, 512xc3x97152 and 1024xc3x971024 pixels. It is extremely rare for a complex algorithm to extract an object from a scene larger than 2048xc3x972048 pixels, in view of the fact that historically even a workstation has a limited memory capacity to handle large images.
From the above discussion, it is clear that there exists a gap in the concept of scale in a physical space, and a gap in formation processing between the mapping community and pattern recognition scientists. In essence, cartographers deal with space in degrees of longitude and latitude, whereas image processing scientists think in terms of objects in a scene of 512xc3x97512 pixels. Among other objects of this invention, this conceptual and information processing gap is to be bridged.
The present invention is an innovative object-recognition system that divides objects into two broad categories, viz., wherein an analyst can articulate, after examining the scene content; how he or she would extract an object, and, secondly, wherein an analyst cannot articulate how to discriminate an object against other competing object descriptors, after examining the scene or a set of object descriptors (e.g., a spectral signature or a boundary contour).
In the first case, where an analyst is able to articulate the extraction of the objects, the proposed solution is to employ a pseudo-human language, including, but not limited to, pseudo-English as a programming language. The analyst can communicate with a machine system by using this pseudo-human language, and then inform the machine how he or she would extract a candidate object without having to rely on a xe2x80x9cthird-partyxe2x80x9d programmer.
In the second case, where an analyst is unable to articulate the extraction of an object, the proposed solution is to use an appropriate matcher with a matching library to extract the candidate object, and then pass it over to processors employed in the first-category sphere. Once an extracted object is passed over to the first environment, this object becomes describable by using the proposed pseudo-human language. Thus, it can be combined with other xe2x80x9cexisting objectsxe2x80x9d to extract still further objects. The final result, then, is the extraction of a set of complex objects or compound objects.
In the past 50 years, photointerpreters have been taught to use the principles governing the following aspects in recognizing an object: (1) tone or spectrum principles; (2) texture (spatial variation of tones) (3) size; (4) shape; (5) shadow (detection of vertical objects); (6) pattern (geometry and density) (7) associated features (contextual information); and (8) stereoscopic characteristics (height), if available.
From these principles based on the human visual system, an object can exist in a number of forms, as shown below in Table I.
From Table I data, it can be inferred that a spectral-matching-alone system can extract only one of the seven object types, i.e., (a). A shape-alone system can extract only two of the seven object types, i.e., (b) and (d).
The proposed system of this invention is intended for extracting all seven types of objects by using image and map data, such as synthetic aperture radar (SAR), and multi-spectral and other types of sensory data, with the assumption that appropriately matching libraries are available. For example, such libraries are readily available: (1) hyperspectral library of various material types; (2) ground vehicle library for FLIR (forward-looking infrared) applications; and (3) ground vehicle library for LADAR (laser radar) applications.
Using these libraries, the method of this invention first extracts single-pixel and single-region-based objects, and then xe2x80x9cgluesxe2x80x9d them together to form multi-object-based object complexes.
Table II below illustrates this two-stage, object-extraction approach.
The uniqueness of this inventive method lies in using a pseudo-human language (such as a pseudo-English-based programming language), compatible with an interpreters"" language, to perform this xe2x80x9cobject-gluingxe2x80x9d process. For example, to extract a complex object having two subparts, such as an engine and a body, the following Table can be utilized.
In line 1 of Equation 1, both Partxe2x80x941 and Partxe2x80x942 are extracted by using a rule-based system in the inventive objective-recognition system. These two objects can also be extracted by using a matching library; in this case, though, Lines 2 and Line 3 in Equation 1 will not be necessary, however they must be extracted before Equation 1 is executed.
Another innovative feature of the system, therefore, is that it allows one to integrate a matcher-based classifier with a rule-based classifier within one image-exploitation environment.
When asked how a conclusion is derived for a given image complex in an area of interest (AOI) that may contain a target (or is, in fact, a target of a certain type), a photointerpreter would most likely give an answer in terms of a combination of these photointerpretation keys listed in the second and third equation, column II Table III below, or as the following:
1. The area is a small regionxe2x80x94a size criterion. 2. The area contains a bright spotxe2x80x94a tone criterion. 3. It is not a vegetated regionxe2x80x94an associated feature principle. 4. It is close to a trail or a roadxe2x80x94an associated feature principle.xe2x80x83xe2x80x83Equation (2)
Equation 2 indicates that the photointerpreter is capable of articulating a target-extraction process in terms of two levels of sophistication: (1) how a target complex is different from its background; and (2) how a given target is different from the other targets.
This group of targets is denotable as xe2x80x9cdescribable or articulatable by using a photointerpreter-based human language.xe2x80x9dxe2x80x83xe2x80x83(Equation 3)
In many cases, it is difficult to articulate by using a human language how one spectral curve is different from another, or how to match one observed boundary contour with a set of contours that is stored in a shape library. However, one can obtain a correct match by using a matching algorithm that is executed by a computer.
This group of targets can be denoted as xe2x80x9ccannot be articulated with a human language, but extractable by using a computer-based, matching algorithmxe2x80x9d.xe2x80x83xe2x80x83(Equation 4)
In certain cases, a target complex becomes describable via a human language, after a computer-based matcher has identified the internal features (parts) of the target. For example, if a camouflage net comprises three kinds of materials (e.g., green-colored material, tan-colored and yellow-colored), one can identify each pixel by its material type and output the results in terms of a three-color decision map. The target complex thus becomes a describable object, such as:
(1) The area of interest contains three types of material, viz., green-, tan- and yellow-color based; this is a tone principle.
(2) The three colors touch one another; this is an associated-feature principle, as well as a texture principle.
(3) The sum of these pixels is in an interval of 15-to-20 pixels; this is a size principle.
This group of targets can be denoted as xe2x80x9cdescribable or articulatable, after a classification process is completed with a matching libraryxe2x80x9d.xe2x80x83xe2x80x83(Equation 5)
To extract objects and features for mapping purposes, it is a common practice by cartographers to use high resolution panchromatic orthophotosxe2x80x94imagery with a constant scale over the entire scenexe2x80x94as the source. These digital orthophotos usually cover a large geographic region that uses 7.5 minutes in longitude and latitude directions as one basic spatial unit. In this spatial framework, it is not uncommon to find an orthophoto that covers a spatial extent of one-half degree in both longitude and latitude directions. How can such a geographic dimension be translated into image sizes?
Consider an approximate length for one degree on the earth coordinate system: 110 kilometers. One half of a degree is approximately 55 km. If a panchromatic image has a linear spatial resolution of 5 meters per pixel, a square region of one half of a degree is equivalent to 11,000xc3x9711,000 pixels. A one-degree region on the earth is covered by a scene of 22,000xc3x9722,000 pixels at the linear resolution of 5 meters per pixel. It is not unusual for a cartographer to extract features from a one-degree scene. In digital image processing, on the other hand, a unit of analysis is usually set at the level of 512 by 512 pixels or 1024 by 1024 pixels. In other words, it is rare that a sophisticated feature extraction algorithm is applied to scene of 2024 by 2024 pixels.
Using the human visual system as means for object extraction, a cartographer can handle a one-degree scene without much difficulty. Since an object like a runway has distinct spatial base and dimension, to extract it from an image, the cartographer must have destroyed a tremendous amount of information, at the same time creating additional information that does not exist in the image. This invention simultaneously creates and destroys information for object extraction in images by using a digital computer, just as a sculptor creates a work of art while simultaneously removing undesired material.
The present invention proposes a communication means between an analyst and a computer, or, a human computer interface, in the form of a pseudo-human-based programming language, with which a photo-interpreter can extract the two types of target complexes.
In addition to serving as an interface module between an analyst and a computer, this language functions in two significant ways: (1) it is a vehicle for one to capture and preserve the knowledge of the human analysts; and (2) it is an environment in which an analyst can organize his or her image-exploitation knowledge into computer-compilable programs, e.g., it is an environment for knowledgeably engineering automatic, object-recognition processes.
Table IV summarizes the above-discussed target extraction process with a pseudo-English language as a programming language.
The inventive system is based upon the model shown in Table IV. The target extraction philosophy of this invention can also be summarized in Table V by using the model of Table IV as its base.
Schutzer (1985) in his article entitled, xe2x80x9cThe Tools and Techniques of Applied Artificial Intelligencexe2x80x9d in Andriole (1985 ed.), listed LISP and PROLOG as applicable, artificial-intelligence (AI) languages. The inventive language differs from these.
First (as noted by Schutzer), LISP as a short form of xe2x80x9cList Processorxe2x80x9d is primarily designed as a xe2x80x9csymbol-manipulation languagexe2x80x9d. While it is a powerful language, it differs significantly from the inventive language form illustrated in Equation 1 in that each system has its own, distinct vocabulary and syntax.
The second major AI language discussed by Schutzer, PROLOG, denotes xe2x80x9cprogramming in logicxe2x80x9d. The most distinct feature of PROLOG is that in solving a program therewith, the user states the problem, but not the procedures by which the problem is to be solved. In the inventive system, the analyst must specify the exact procedures as to how a specific object is to be defined (as illustrated in Equation 1), in which a third object is extracted, because the first object xe2x80x9ctouchesxe2x80x9d the second object.
Conventional computing languages also include assembler, FORTRAN, PASCAL, C, C++, etc. All of these languages are machine-oriented, rather than human oriented. Thus, these languages are rarely used for man-machine interface purposes. In contrast, as discussed earlier, the inventive language is a bona fide man-machine interface module.
The conventional, man-machine interface means is based on a so-called graphic user interface (GUI). A GUI is generally characterized by a process with which one xe2x80x9cpoints and clicksxe2x80x9d a graphic iron to initiate a specific data-processing task. A simple GUI allows one to execute a program by xe2x80x9cpointing and clickingxe2x80x9d one item at a time. A sophisticated GUI allows one to build a system of processing modules, using a graphic representation, by connecting multiple sub-modules. This process is similar to using C-Shell to link a set of processors.
The inventive human computer interface differs from a conventional GUI in three significant ways:
(1) No matter how sophisticated a conventional GUI is, it does not create additional information; in contrast, the inventive system creates information by integrating multiple sets of input sources.
(2) In the inventive programming-language system, an analyst creates a solution algorithm at the micro level, in addition to the system level; in contrast, with a conventional GUI, the analyst can manipulate only at the module level.
(3) Lastly, a conventional GUI is not designed for knowledge engineering; in contrast, the inventive system is designed primarily for knowledge engineering and knowledge capture.
The advantages of one solution system over another depend largely on the degree of difficulty of a problem. For example, if the task is simple enough, any solution algorithm can solve the problem. In this situation, one cannot detect the advantage of one system over the other. However, if a problem is extremely complex, the advantage of a solution system, if any, over its competitor will loom large. Since object extraction with image- and/or map-data is extremely complex, the advantages of the inventive system over the other systems are significant.
Using a linguistic approach to solve a spatial-analysis problem is not new. For example, Andriole (1985) showed how researchers have used natural language for applications in artificial intelligence. Additional, expert, systems-based examples can be obtained from Hayes-Roth, Waterman and Lynat (1983).
Indeed, xe2x80x9cthe conceptualization of space and its reflection in languagexe2x80x9d is a critical research agenda item for NCGIA (National Center for Geographic Information Analysis), particularly with respect to Initiative 2 (Mark et. al., 1989; Mark, 1992; Egenhofer and Frank, 1990; Egenhofer, 1994).
The papers by NCGIA-associated researchers show that a large number of spatial-analysis problems can definitely be articulated by using certain English words that establish spatial relationships among objects. For example, Egenhofer (1994) has developed mathematical rules for establishing nine spatial relationships between two spatially-contiguous objects; these relationships have their counterparts in English, such as xe2x80x9cmeetxe2x80x9d, xe2x80x9cinsidexe2x80x9d, xe2x80x9ccoversxe2x80x9d, xe2x80x9ccovered byxe2x80x9d. To articulate a spatial-analysis problem by employing these spatial-relationship terms, an example follows.
One of the famous objects in the Washington, D.C., region is the Pentagon. A goal is to extract this object, using LANDSAT(trademark) data. It is well-known that the Pentagon has a grassy region at the center of the building, called xe2x80x9cGround Zeroxe2x80x9d. Suppose that this grass-based region is extracted by using a greenness, transformed band, derived from the TM data. Denote this object as xe2x80x9cCourtyardxe2x80x9d. Secondly, using TM""s thermal band (#6) data, one can extract Pentagon in terms of a xe2x80x9chot buildingxe2x80x9d. Therefore, one can define the Pentagon in terms of a unique, spatial relationship between the xe2x80x9cCourtyardxe2x80x9d and the xe2x80x9chot buildingxe2x80x9d as follows:
Equation 7 introduces three key words and/or phrases: xe2x80x9cisxe2x80x9d, xe2x80x9coutsidexe2x80x9d and xe2x80x9cwithin 15xe2x80x9d. Using these three key words and/or phrases, one can articulate that, in order to extract xe2x80x9cPentagonxe2x80x9d, one needs to associate a hot building with a grassy courtyard, one object is inside or outside of the other, etc.
In a problem-solving setting, one usually writes a solution algorithm in pseudo-code first, and then has a programmer convert the pseudo-code into a compilable program that is written in a standard computer-language, such as C or C++. The pseudo code comprises a set of English words or phrases that specify the process by which a particular problem is to be solved. The solution algorithm is generally referred to as a rule set. For the inventive system to work, the vocabularies in the rule set must be xe2x80x9ccallablexe2x80x9d functions. Using Equation 7 as an example, xe2x80x9cisxe2x80x9d xe2x80x9coutsidexe2x80x9d and xe2x80x9cwithin xc3x97xe2x80x9d are callable functions.
In many spatial-analysis scenarios (such as locating a facility that meets certain spatial constraints), one can easily write a conceptual algorithm or rule set for it. However, executing this rule set by using a computer can be either impossible or could take a lot of man-hours to code by using a standard computer-language. Therefore, for the inventive system to be workable, the rule set one composes must follow a certain xe2x80x9clegalxe2x80x9d format so that it is compilable. For example, Equation 7 above is legal in terms of the syntax and key words; however, it is not compilable.
Table VI (Equation 8) is a compilable program, because it meets all of the requirements therefor. Line 7 of Equation 8 is exactly Equation 7. Therefore, Equation 7 is equivalent to a subroutine in Equation 8.
The last requirement of the inventive approach is that a software environment that is capable of accepting Equation 8 as a computer program must exist. Otherwise, Equation 8 is merely a conceptual algorithm, instead of a researchers to think integration among GIS, remote sensing and geography in two levels: 1xe2x80x94technological integration; and 2xe2x80x94conceptual integration. Dobson suggested that conceptual integration is much more difficult than technical integration.
While Dobson (1993, p. 1495) predicted that xe2x80x9ctechnical integration will remain an illusive target not likely to be achieved for decades, xe2x80x9d the present invention proposes that, by using pseudo-English as a programming language, one can shorten this predicted timetable from decades to years, and make xe2x80x9ctechnological integrationxe2x80x9d an integral part of xe2x80x9cconceptual integrationxe2x80x9d.
International Publication No. WO 93/22762, by William Gibbens REDMANN et al., discloses a system for tracking movement within a field of view, so that a layman can conduct the performance of a prerecorded music score by means of image processing. By analyzing change in centers of movement between the pixels of the current image and those of previous images, tempo and volume are derived.
The REDMANN reference is concerned only with pixels. The change of pixels denote movement, and therefore are dispositive of the orchestration or baton response. REDMANN does not seek to recognize one or more objects within an image. Moreover, REDMANN requires movement of features in an image in order to perform its function. Pixels are compared with one another to discover movement of a baton. However, this simple change in pixel orientation, denoting movement of the baton, is not a sophisticated analysis of an image for purposes of image recognition.
In fact, the invention can be presented with, and can analyze, an image having no movement whatsoever: a stationary orchestral leader or a battlefield terrain, for example. In such a case, the REDMANN system would not provide any information about the image whatsoever. The inventive method, in contrast, moves across boundaries, in that images can be analyzed for stationary objects, areas (especially large regions), portions, color, texture, background, infra-red analysis, and movement. By contrast, the REDMANN system can consider only movement of pixels.
The inventive method takes diverse bits of information such as pixel information and xe2x80x9cgluesxe2x80x9d that information onto totally alien information, such as infra-red analysis. In other words, objects such as the turret, gun barrel, or engine of a tank are determined through infra-red analysis, color, texture, pixels, shape, location within the object, etc. This is in contrast to the REDMANN system, which simply sums the coordinates of vertical and horizontal pixel movements, by analyzing vectors. In short, xe2x80x9csummingxe2x80x9d is not xe2x80x9cgluing.xe2x80x9d
Moreover, the inventive process extracts both simple and complex objects, using a rule-based approach, with image- and/or map-data as inputs, as opposed to REDMANN, which does not use map-data and does not extract simple and complex objects.
In short, this invention yields a much simpler, more effective and direct human-machine-interface-based, object-recognition environment, one in which the programming language is a human-like language. In addition, the invention achieves integration between a rule-based recognition and a match-filter based recognition system, despite the fact that, until now, these methods have been treated as mutually exclusive processes. The methodology of the invention seeks to define imagery with a highly complex, high-level, three-tiered analysis, which analysis provides data that is described in simple human-type language.
In accordance with the present invention, the fundamental concept of object recognition is to employ a human-like language that is based on the vocabularies used by photointerpreters in order to write solution algorithms. The present invention is an environment that allows these pseudo-English-based programs to be compiled, after which simple, as well as complex, objects can be extracted. Image- and/or map- data is used as inputs. A grayscale image primitive base map is created, that can be directly converted to regions of pixels in a raw scene of a large area. The process for creating such a primitive base map is by applying a data analysis method, such as simple thresholding (based on size, shape, texture, tone, shadow, or associated features), stable structure segmentation, transforms or hyperspectral analysis.
Based on this human-computer interface, in which pseudo-English is a programming language, the object-recognition system comprises of three major logic modules: (1) the input-data module; (2) the information-processing module, coupled with the above-noted human-computer interface (HCI) module; and (3) the output module, that has a feedback mechanism back to the main information-processing and the input-data module. Using this invented system, one uses three strategies to extract an object: (1) if one can articulate how the object can be used by using the human visual system, one uses a rule-based approach to extract the object; (2) if one cannot articulate as to how an object can be discerned against others, one uses a match-filter approach to recognize the object; and (3) after all the objects are extracted with the first-tier processors, one uses the inventive, human-language-based, programming language of this invention, in order to create compound objects by xe2x80x9cgluingxe2x80x9d together the already-extracted objects.
The invention provides a mechanism for generating feature primitives from various imagery types for object extraction generalizable to a climatic zone instead of a small image frame such as 512xc3x97512 or 1024xc3x971024 pixels. The mechanism simultaneously destroys and creates information to generate a single band image containing spatial feature primitives for object recognition from single band, multispectral and multi-sensor imagery. Cartographers and image analysts are thus provided with a single-band imagery for extracting objects and features manually and/or automatically by using expert rule sets.
It would be advantageous to provide a means (a base map) by which terrain features and objects are readily extractable without making object extraction decisions at a pixel analysis level, a tedious, costly and error prone process.
It would be further advantageous to provide a means for extracting features and objects that is generalizable to large geographic regions covering several degrees of the surface of the earth.
It would also be advantageous to provide a one-band scene from a multi-band source, such as a three-band near infrared (NIR) data set, to allow cartographers and/or an automated system to perform feature extraction.
It would further be advantageous to divide feature and object extraction in two stages, Stage 1 being the preparation of a common image map that is generalizable to a large geographic region by using a totally automated, parallel, distributed data processing mode, Stage 2 being the actual process of feature extraction to be performed by a cartographer and/or a machine system using expert knowledge and rule sets.
It is an object of this invention to provide a method of organizing the world of objects in terms of only two categories: those an analyst can articulate with his or her vocabularies as to how to extract it, and the other for which an analyst is better off using a matcher to perform the object-recognition task.
It is another object of this invention to provide a method of xe2x80x9cgluingxe2x80x9d together already-extracted objects to form new objects by using an abstract, three-dimensional, space-based, spatial-analysis system in which one can develop a solution algorithm with human-like vocabularies, such as xe2x80x9ctouchesxe2x80x9d, and xe2x80x9csurroundedxe2x80x9d, etc.
It is yet another object of the invention to provide a method of achieving an object-recognition task without requiring tedious, error-prone, difficult-to-understand formal-programming- and operating-system protocol.
It is a further object of this invention to provide a method of extracting additional information when complex objects are extracted, based on already-extracted, single-rule-based objects.
It is yet a further object of this invention to provide a method of preserving the knowledge of xe2x80x9cexpertsxe2x80x9d.
It is still another object of the invention to provide a method of utilizing an environment in which an analyst""s knowledge and object-extraction concepts can be engineered into a machine-compilable program.