The present invention relates generally to the field of automated image identification. In particular, identification of objects depicted in one ore more image frames of segment of video. The present invention teaches methods for rapidly scrutinizing digitized image frames and classifying and cataloging objects of interest depicted in the video segment by filtering said image frames for various differentiable characteristics of said objects and extracting relevant data about said objects while ignoring other features of each image frame.
Prior art devices described in the relevant patent literature for capturing one or more objects in a scene typically include a camera device of known location or trajectory, a scene including one or more calibrated target objects, and at least one object of interest (see U.S. Pat. No. 5,699,444 to Sythonics Incorporated). Most prior art devices are used for capture of video data regarding an object operate in a controlled setting, oftentimes in studios or sound stages, and are articulated along a known or preselected path (circular or linear). Thus, the information recorded by the device can be more easily interpreted and displayed given the strong correlation between the perspective of the camera and the known objects in the scene.
To capture data regarding objects present in a scene a number of techniques have been successfully practiced. For example, U.S. Pat. No. 5,633,944 entitled xe2x80x9cMethod and Apparatus for Automatic Optical Recognition of Road Signsxe2x80x9d issued May 27, 1997 to Guibert et al. and assigned to Automobiles Peugeot discloses a systems wherein a laser beam, or other source of coherent radiation, is used to scan the roadside in an attempt to recognize the presence of signs.
Additionally, U.S. Pat. No. 5,790,691 entitled xe2x80x9cMethod and Apparatus for Robust Shape Detection Using a Hit/Miss Transformxe2x80x9d issued Aug. 4, 1998 to Narayanswamy et al. and assigned to the Regents of the University of Colorado (Boulder, Colo.) discloses a system for detecting abnormal cells in a cervical Pap-smear. In this system a detection unit inspects a region of interest present in two dimensional input images and morphologically detects structure elements preset by a system user. By further including a thresholding feature the shapes and/or features recorded in the input images can deviate from structuring elements and still be detected as a region of interest. This reference clearly uses extremely controlled conditions, known presence of objects of interest, and continually fine-tuned filtering techniques to achieve reasonable performance. Similarly, U.S. Pat. No. 5,627,915 entitled xe2x80x9cPattern Recognition System Employing Unlike Templates to Detect Objects Having Distinctive Features in a Video Fieldxe2x80x9d issued May 6, 1997 to Rosser et al. and assigned to Princeton Video Image, Inc. of Princeton, N.J. discloses a method for rapidly and efficiently identifying landmarks and objects using a plurality of templates that are sequentially created and inserted into live video fields and compared to a prior template(s) in order to successively identify possible distinctive feature candidates of a live video scene and also eliminate falsely identified features. The process disclosed by Rosser et al. is repeated in order to preliminarily identify two or three landmarks of the target object the locations of these xe2x80x9clandmarksxe2x80x9d of the target object and finally said landmarks are compared to a geometric model to further verify if the object has been correctly identified by process of elimination. The methodology lends itself to laboratory verification against pre-recorded videotape to ascertain accuracy before applying said system to actual targeting of said live objects. This system also requires specific templates of real world features and does not operate on unknown video data with its inherent variability of lighting, scene composition, weather effects, and placement variation from said templates to actual conditions in the field.
Further prior art includes U.S. Pat. No. 5,465,308 entitled xe2x80x9cPattern Recognition Systemxe2x80x9d issued Nov. 7, 1995 to Hutcheson et al. and assigned to Datron/Transoc, Inc. of Simi Valley, Calif. discloses a method and apparatus under software control that uses a neural network to recognize two dimensional input images which are sufficiently similar to a database of previously stored two dimensional images. The images are processed and subjected to a Fourier transform (which yields a power spectrum and then a in-class/out-of-class sort is performed). A feature vector consisting of the most discriminatory magnitude information from the power spectrum is then created and are input to a neural network preferably having two hidden layers, input dimensionality of elements of the feature vector and output dimensionality of the number of data elements stored in the database. Unique identifier numbers are preferably stored along with the feature vector. Applying a query feature vector to the neural network results in an output vector which is subjected to statistical analysis to determine whether a threshold level of confidence exists before indicating successful identification has occurred. Where a successful identification has occurred a unique identifier number for the identified object may be displayed to the end user to indicate. However, Fourier transforms are subject to large variations in frequency such as those brought on by shading, or other temporary or partial obscuring of objects, from things like leaves and branches from nearby trees, scratches, bullet holes (especially if used for recognizing road signs), commercial signage, windshields, and other reflecting surfaces (e.g., windows) all have very similar characteristics to road signs in the frequency domain.
In summary, the inventors have found that in the prior art related to the problem of accurately identifying and classifying objects appearing in a videodata most all efforts utilize complex processing, illuminated scenes, continual tuning of a single filter and/or systematic comparison of aspects of an unknown object with a variety of shapes stored in memory. The inventors propose a system that efficiently and accurately retrieves and catalogs information distilled from vast amounts of videodata so that object classification type(s), locations, and bitmaps depicting the actual condition of the objects (when originally recorded) are available to an operator for review, comparison, or further processing to reveal even more detail about each object and relationships among objects.
The present invention thus finds utility over this variety of prior art methods and devices and solves a long-standing need in the art for a simple apparatus for quickly and accurately recognizing, classifying, and locating each of a variety of objects of interest appearing in a videostream. Determining that an object is the xe2x80x9csamexe2x80x9d object from a distinct image frame.
The present invention addresses an urgent need for virtually automatic processing of vast amounts of video dataxe2x80x94that possibly depict one or more desired objectsxe2x80x94and then precisely recognize, accurately locate, extract desired characteristics, and, optionally, archive bitmap images of each said recognized object. Processing such video information via computer is preferred over all other forms of data interrogation, and the inventors suggest that such processing can accurately and efficiently complete a task such as identifying and cataloguing huge numbers of objects of interest to many public works departments and utilities; namely, traffic signs, traffic lights, man holes, power poles and the like disposed in urban, suburban, residential, and commercial settings among various types of natural terrain and changing lighting conditions (i.e., the sun).
The exemplary embodiment described, enabled, and taught herein is directed to the task of building a database of road signs by type, location, orientation, and condition by processing vast amounts of video image frame data. The image frame data depict roadside scenes as recorded from a vehicle navigating said road. By utilizing differentiable characteristics the portions of the image frame that depict a road sign are stored as highly compressed bitmapped files each linked to a discrete data structure containing one or more of the following memory fields: sign type, relative or absolute location of each sign, reference value for the recording camera, reference value for original recorded frame number for the bitmap of each recognized sign. The location data is derived from at least two depictions of a single sign using techniques of triangulation, correlation, or estimation. Thus, output signal sets resulting from application of the present method to a segment of image frames can include a compendium of data about each sign and bitmap records of each sign as recorded by a camera. Thus, records are created for image-portions that possess (and exhibit) detectable unique differentiable characteristics versus the majority of other image-portions of a digitized image frame. In the exemplary sign-finding embodiment herein these differentiable characteristics are coined xe2x80x9csign-ness.xe2x80x9d Thus, based on said differentiable characteristics, or sign-ness, information regarding the type, classification, condition (linked bitmap image portion) and/or location of road signs (and image-portions depicting said road signs) are rapidly extracted from image frames. Those image frames that do not contain an appreciable level of sign-ness are immediately discarded.
Differentiable characteristics of said objects include convexity/symmetry, lack of 3D volume, number of sides, angles formed at corners of signs, luminescence or lumina values, which represent illumination tolerant response in the L*u*v* or LCH color spaces (typically following a transforming step from a first color space like RGB); relationship of edges extracted from portions of image frames, shape, texture, and/or other differentiable characteristics of one or more objects of interest versus background objects. The differentiable characteristics are preferably tuned with respect to the recording device and actual or anticipated recording conditions are taught more fully hereinbelow.
The method and apparatus of the present invention rapidly identifies, locates, and stores images of objects depicted in digitized image frames based upon one or more differentiable characteristic of the objects (e.g., versus non-objects and other detected background noise). The present invention may be implemented in a single microprocessor apparatus, within a single computer having multiple processors, among several locally-networked processors (i.e., an intranet), or via a global network of processors (i.e., the internet and similar). Portions of individual image frames exhibiting an appreciable level of pre-selected differentiable characteristics of desired objects are extracted from a sequence of video data and said portions of the individual frames (and correlating data thereto) are used to confirm that a set of several xe2x80x9cimagesxe2x80x9d in fact represent a single xe2x80x9cobjectxe2x80x9d of a class of objects. These preselected differentiable characteristic criteria are chosen from among a wide variety of detectable characteristics including color characteristics (color-pairs and color set memberships), edge characteristics, symmetry, convexivity, lack of 3D volume, number and orientation of side edges, characteristic corner angles, frequency, and texture characteristics displayed by the 2-dimensional (2D) images so that said objects can be rapidly and accurately recognized. Preferably, the differentiable characteristics are chosen with regard to anticipated camera direction relative to anticipated object orientation so that needless processing overhead is avoided in attempting to extract features and characteristics likely not present in a given image frame set from a known camera orientation. Similarly, in the event that a scanning recording device, or devices, are utilized to record objects populating a landscape, area, or other space the extraction devices can be preferably applied only to those frames that likely will exhibit appreciable levels of an extracted feature or characteristic.
In a preferred embodiment of the inventive system taught herein, is applied to image frames and unless at least one output signal from an extraction filter preselected to capture or highlight a differentiable characteristic of an object of interest exceeds a threshold value the then-present image frame is discarded. For those image frames not discarded, an output signal set of location, type, condition, and classification of each identified sign is produced and linked to at least one bitmap image of said sign. The output signal set and bitmap record(s) are thus available for later scrutiny, evaluation, processing, and archiving. Of course, prefiltering or conditioning the image frames may increase the viability of practicing the present invention. Some examples include color calibration, color density considerations, video filtering during image capture, etc.
In a general embodiment of the present invention, differentiable characteristics present in just two (2) images of a given object are used to confirm that the images in fact represent a single object without any further information regarding the location, direction, or focal length of an image acquisition apparatus (e.g., digital camera) that recorded the initial at least two image frames. However, if the location of the digital camera or vehicle conveying said digital camera (and the actual size of the object to be found) are known, just a single (1) image of an object provides all the data required to recognize and locate the object.
The present invention has been developed to identify traffic control, warning, and informational signs, xe2x80x9croad signsxe2x80x9d herein, that appear adjacent to a vehicle right-of-way, are visible from said right of way, and are not obscured by non-signs. These road signs typically follow certain rules and regulations relative to size, shape, color (and allowed color combinations), placement relative to vehicle pathways (orthogonal), and sequencing relative to other classes of road signs. While the term xe2x80x9croad signxe2x80x9d is used throughout this written description of the present invention, a person of ordinary skill in the art to which the invention is directed will certainly realize applications of the present invention to other similar types of object recognition. For example, the present invention may be used to recognize, catalogue, and organize searchable data relative to signs adjacent a rail road right of way, nature trailways, recreational vehicle paths, commercial signage, utility poles, pipelines, billboards, man holes, and other objects of interest that are amenable to video capture techniques and that inherently possess differentiable characteristics relative to their local environment. Of course, the present invention may be practiced with imaging systems ranging from monochromatic visible wavelength camera/film combinations to full color spectrum visible wavelength camera/memory combinations to ultraviolet, near infrared, or infrared imaging systems, so long as basic criteria are present: object differentiability from its immediate milieu or range data.
Thus, the present invention transforms frames of digital video depicting roadside scenes using a set of filters that are logically combined together with OR gates or combined algorithmically and each output is equally weighted, and that each operate quickly to capture a differentiable characteristic of one or more road sign of interest. Frequency and spatial domain transformation, edge domain transformation (Hough space), color transformation typically from a 24 bit RGB color space to either a L*u*v* or LCH color space (using either fuzzy color set tuning or neural network tuning for objects displaying a differentiable color set), in addition to use of morphology (erosion/dilation), and a moment calculation applied to a previously segmented image frame is used to determine whether an area of interest that contains an object is actually a road sign. The aspect ratio and size of a potential object of interest (an xe2x80x9cimagexe2x80x9d herein) can be used to confirm that an object is very likely a road sign. If none of the filters produces an output signal greater than a noise level signal, that particular image frame is immediately discarded. The inventors note that in their experience, if the recording device is operating in an urban setting with a recording vehicle operating at normal urban driving speeds and the recording device has a standard frame rate (e.g., thirty frames per second) only about twelve (12) frames per thousand (1.2%) have images, or portions of image frames, that potentially correlate to a single road sign of sufficiently detectable size. Typically only four (4) frames per thousand actually contain an object of interest, or road sign in the exemplary embodiment. Thus, a practical requirement for a successful object recognition method is the ability to rapidly cull the ninety-eight percent (98%) of frames that do not assist the object recognition process. In reality, more image frames contain some visible cue as to the presence of a sign in the image frame, but the amount of differentiable data is typically recorded by the best eight (8) of so images of each potential object of interest. The image frames are typically coded to correspond to a camera number (if multiple cameras are used) and camera location data (i.e., absolute location via GPS or inertial coordinates if INS is coupled to the camera of camera-carrying vehicle). If the location data comprises a time/position database directly related to frame number (and camera information in a multi-camera imaging system) extremely precise location information is preferably derived using triangulation of at least two of the related xe2x80x9cimagesxe2x80x9d of a confirmed object (road sign).
The present invention successfully handles partially obscured signs, skewed signs, poorly illuminated signs, signs only partially present in an image frame, bent signs, and ignores all other information present in the stream of digital frame data (preferably even the posts that support the signs). One of skill in the art will quickly recognize that the exemplary system described herein with respect to traffic control road signs is readily adaptable to other similar identification of a large variety of man-made structures. For example, cataloging the location, direction the camera is facing, condition, orientation and other attributes of objects such as power poles, telephone poles, roadways, railways, and even landmarks to assist navigation of vehicles can be successfully completed by implementing the inventive method described herein upon a series of images of said objects. In a general embodiment, the present invention can quickly and accurately distill arbitrary/artificial objects disposed in natural settings and except for confirming at least one characteristic of the object (e.g., color, linear shape, aspect ratio, etc.), the invention operates successfully without benefit of pre-existing knowledge about the full shape, actual condition, or precise color of the actual object.
The present invention is best illustrated with reference to one or more preferred embodiments wherein a series of image frames (each containing a digital image of at least a portion of an object of interest) are received, at least two filters (or segmentation algorithms) applied, spectral data of the scene scrutinized so that those discrete images that exceed at least one threshold of one filter during extraction processing become the subject of more focused filtering over an area defined by the periphery of the image. The periphery area of the image is found by applying common region growing and merging techniques to grow common-color areas appearing within an object. The fuzzy logic color filter screens for the color presence and may be implemented as neural network. In either event, an image area exhibiting a peak value representative of a color set which strongly correlates to a road sign of interest is typically maintained for further processing. If and only if the color segmentation routine fails, a routine to determine the strength of the color pair output is then applied to each image frame that positively indicated presence of a color pair above the threshold noise level. Then further segmentation is done possibly using color, edges, adaptive thresholding, color frequency signatures, or moment calculations. Preferably the image frame is segmented into an arbitrary number of rectangular elements (e.g,. 32 or 64 segments). The area where the color pair was detected is preferably grown to include adjacent image segments that also exhibit an appreciable color-pair signal in equal numbered segments. This slight expansion of a search space during the moment routine does not appreciably reduce system throughput in view of the additional confirming data derived by expanding the space. Morphology techniques are then preferably used to grow and erode the area defined by the moment routine-segmented space until either the grown representation meets or fails to meet uniform criteria during the dilation and erosion of the now segmented image portion of the potential object (xe2x80x9cimagexe2x80x9d). If the image area meets the morphological criteria a final image periphery is calculated. Preferably this final image periphery includes less than the maximum, final grown image so that potential sources of error, such as non-uniform edges, and other potentially complex pixel data are avoided and the final grown representation of the image essentially includes only the actual colored xe2x80x9cfacexe2x80x9d of the road sign. A second order calculation can be completed using the basic segmented moment space which determines the xe2x80x9ctexturexe2x80x9d of the imaged area although the inventors of the present invention typically do not routinely sample for texture.
The face of the road sign can be either the colored front portion of a road sign or the typically unpainted back portion of a road sign (if not obscured by a sign mounting surface). For certain classes of road signs, only the outline of the sign is all that is needed to accurately recognize the sign. One such class is the ubiquitous eight-sided stop sign. A xe2x80x9cbounding boxxe2x80x9d is defined herein as a polygon which follows the principal axis of the object. Thus, rotation, skew or a camera or a sign, and bent signs are not difficult to identify. The principal axis is a line through the center of mass and at least one edge having a minimum distance to all pixels of the object. In this way a bounding box will follow the outline of a sign without capturing non-sign image portions.
Then, the aspect ratio of the finally grown image segments is calculated and compared against a threshold aspect ratio set (three are used herein, each corresponding to one or more classes of road signs) and if the value falls within preset limits, or meets other criteria such as a percentage of color (# of pixels), moments, number of corners, corner angles, etc., the threshold the image portion (road sign face) is saved in a descending ordered listing of all road signs of the same type (where the descending order corresponds to the magnitude or strength of other depictions of possible road signs). For a class of road signs where the sign only appears in as a partial sign image the inventors do not need special processing since only three intersecting edges (extracted via a Hough space transformation) grown together if necessary in addition to color-set data is required to recognize most every variety of road sign. The aspect ratio referred to above can be one of at least three types of bounding shape: a rectangular (or polygon) shape, an ellipse-type shape, or a shape that is mathematically related to circularity-type shape. For less than four-sided signs the rectangular polygon shapes are used and for more than four sides the ellipse-type shapes are used.
The frame buffer is typically generated by a digital image capture device. However, the present invention may be practiced in a system directly coupled to a digital image capture apparatus that is recording live images, or a pre-recorded set of images, or a series of still images, or a digitized version of an original analog image sequence. Thus, the present invention may be practiced in real time, near real time, or long after initial image acquisition. If the initial image acquisition is analog, it must be first digitized prior to subjecting the image frames to analysis in accordance with the invention herein described, taught, enabled, and claimed. Also a monitor can be coupled to the processing equipment used to implement the present invention so that manual intervention and/or verification can be used to increase the accuracy of the ultimate output, a synchronized database of characteristic type(s), location(s), number(s), damaged and/or missing objects.
Thus the present invention creates at least a single output for each instance where an object of interest was identified. Further embodiments include an output comprising one or more of the following: orientation of the road sign image, location of each identified object, type of object located, entry of object data into an Intergraph GIS database, and bitmap image(s) of each said object available for human inspection (printed and/or displayed on a monitor), and/or archived, distributed, or subjected to further automatic or manual processing.
Given the case of identifying every traffic control sign in a certain jurisdiction, the present invention is applied to scrutinize standard videostream of all roadside scenes present in said jurisdiction. Most jurisdictions authorize road signs to be painted or fabricated only with specific discrete color-pairs, and in some cases color-sets (e.g., typically having between one and four colors) for use as traffic control signage. The present invention exploits this feature in an exemplary embodiment wherein a these discrete color-sets form a differentiable criteria. Furthermore, in this embodiment a neural network is rapidly and efficiently trained to recognize regions in the image frames that contain these color-sets. Examples of said color sets presently useful in recognizing road signs in the U.S. include: red/white, white/black/red, green/white/blue, among several others easily cognizable by those of skill in the art.
Of course, certain characteristic colors themselves can assist the recognition of road signs from a scene. For example, a shade of yellow depicts road hazard warnings and advisories, white signs indicate speed and permitted lane change maneuver data, red signs indicate prohibited traffic activity, etc. Furthermore, since only a single font is approved for on-sign text messages in the U.S. character recognition techniques (e.g., OCR) can be applied to ensure accurate identification of traffic control signage as the objects of interest in a videostream. Therefore a neural network as taught herein is trained only on a few sets of image data including those visual characteristics of objects of interest such as color, reflectance, fluorescence, shape, and location with respect to a vehicle right of way operates to accurately identify the scenes in an economical and rapid manner. In addition, known line extracting algorithms, line completion, or xe2x80x9cgrowing,xe2x80x9d routines, and readily available morphology techniques may be used to enhance the recognition processing without adding significant additional processing overhead.
In a general application of the present invention, a conclusion may be drawn regarding whether object(s) appearing in a sequence of video data are fabricated by humans or naturally generated by other than manual processing. In this class of applications the present invention can be applied to enhance the success of search and rescue missions where personnel and vehicles (or portions of vehicles) may be randomly distributed throughout a large area of xe2x80x9cnatural materialsxe2x80x9d. Likewise, the method taught in the present disclosure finds application in undersea, terrestrial, and extra-terrestrial investigations wherein certain xe2x80x9cstructuredxe2x80x9d foreign (artificial or man-made) materials are present in a scene of interest might only occur very infrequently over a very large sample of videostream (or similar) data. The present invention operates as an efficient graphic-based search engine too. The task of identifying and locating specific objects in huge amounts of video data such as searching for missile silos, tanks, or other potential threats depicted in images captured from remote sensing satellites or air vehicles readily benefits from the automated image processing techniques taught, enabled, and disclosed herein.
A person of skill in the art will of course recognize myriad applications of the invention taught herein beyond the repetitive object identification, fabricated materials identification, and navigation examples recited above. These and other embodiments of the present invention shall be further described herein with reference to the drawings appended hereto.
The following figures are not drawn to scale and only detail a few representative embodiments of the present invention, more embodiments and equivalents of the representative embodiments depicted herein are easily ascertainable by persons of skill in the art.