1. Field of the Invention
The invention relates to capture and recording of images for analysis, storage and other purposes, and more particularly to directed image video capture and digital video recordation.
2. Background Art
Directed image video capture is a useful technology for identification of subjects which are targets of interest in captured images. Targeted image capture and analysis are particularly important for forensic, security or information storage purposes.
The term “video” is herein used in its broad sense. Thus, video relates to electronically-captured picture information, as in a television system or monitor system, and may use equipment for the reception, recording or playback of a television or television-like picture. A video system may provide large-capacity video storage for on-demand retrieval, transmission, and other uses, and generally capable of storage in digital media, such as disk drives, or memory arrays. The term video thus emphasizes visual rather than audio aspects of the television or television-like medium. It may involve an animated series of images or sequential series of still images, not necessarily animated. Video relates in any event to the technology of capturing and processing electronic signals representing pictures, by which a series of framed images are put together, one after another, whether smoothly or not, such as to show interactivity and/or motion or the presence of absence of objects, persons or areas of interest in an image field. Video can be captured as individual or sequential frames, and by digital technologies or by analog schemes such as, without limiting the generality of the foregoing, by the use of well-known NTSC protocol involving interlaced frames. Thus, the present inventor in employing the term video refers in a general sense to any method or technology using video, video devices, video or electronic cameras, or television technology to produce an image. Video may involve slow-scanning or high-speed scanning of frames, and wherein images are typically captured by CCD (charge coupled device) sensors. The present invention relates to capturing, analyzing and storing of images using video techniques, and relates most generally to visual information in an integrated system, such as a security system employing one or more video cameras, including cameras that capture images by electronic analog or digital means, whether continuously or intermittently, and including those using techniques involving digitization of images whether captured by either analog or digital electronic sensors.
The present invention, which takes an approach different from the known art, is particularly useful as an improvement of the system and methodology disclosed in a granted patent application owned by the present applicant's assignee/intended assignee, entitled, “System for Automated Screening of Security Cameras” (U.S. Pat. No. 6,940,998, issued Sep. 6, 2005), and herein incorporated by reference. The system disclosed in U.S. Pat. No. 6,940,998 is hereinafter referred to as the PERCEPTRAK system. The term PERCEPTRAK is a registered trademark (Regis. No. 2,863,225) of Cernium, Inc., applicant's assignee/intended assignee, to identify video surveillance security systems, comprised of computers; video processing equipment, namely a series of video cameras, a computer, and computer operating software; computer monitors and a centralized command center, comprised of a monitor, computer and a control panel.
The present invention also takes advantage of, and is particularly useful as an improvement of, the system and methodology disclosed in a copending allowed patent application owned by the present inventor's assignee/intended assignee, namely U.S. application Ser. No. 10/041,402, filed Jan. 8, 2002, entitled, “Object Selective Video Recording”, and hereinafter referred to the OSR patent, and herein incorporated by reference. The system disclosed in the OSR patent is referred to as the OSR system. The OSR system is an object selective video analysis and recordation system in which one or more video cameras provide video output to be recorded in a useful form on recording media with reduction of the amount of the recording media, with preservation of intelligence content of the output. Spatial resolution and temporal resolution of objects in the scene are automatically varied in accordance with preset criteria based on predetermined interest in the object attributes while recording the background video and object video. A user of the OSR system may query recorded video images by specified symbolic content, enabling recall of recorded data according to such content. The term OSR a trademark of Cernium, Inc., applicant's assignee/intended assignee, to identify an object selective video analysis and recordation system, namely as comprised of computers; provision for receiving the video output of video cameras, one or more computers, and computer operating software, computer monitors and a centralized command center in which one or more such video cameras provide output video to be recorded in a useful form on recording media with reduction of the amount of the recording media, yet with preservation of the content of such images.
OSR is a distributed recording system that does not require a command center as used in the Perceptrak system. The OSR command center may be comprised of a monitor, computer and a control panel) in which one or more video cameras provide output video to be recorded in a useful form on recording media with reduction of the amount of the recording media, yet with preservation of the content of such images.
There are various methods of video data analysis. An example method of real-time video analysis of video data is performed in the Perceptrak system. During the analysis, a single pass of a video frame produces a terrain map which contains elements termed primitives which are low level features of the video. Based on the primitives of the terrain map, the system is able to make decisions about which camera an operator should view based on the presence and activity of vehicles and pedestrians and furthermore, discriminates vehicle traffic from pedestrian traffic. The Perceptrak system was implemented to enable automatic decisions to be made about which camera view should be displayed on a display monitor of the CCTV system, and thus watched by supervisory personnel, and which video camera views are ignored, all based on processor-implemented interpretation of the content of the video available from each of at least a group of video cameras within the CCTV system.
The Perceptrak system uses video analysis techniques which allow the system to make decisions automatically about which camera an operator should view based on the presence and activity of vehicles and pedestrians.
An existing implementation of the above-identified Perceptrak system relies on a fixed camera to maintain an adaptive background. The camera must stay fixed in order to segment targets by comparison with the background. Targets can be segmented, and tracked, as small as 10 pixels high by ten wide. On a high-resolution analysis, that is 100/(640*480) or 0.03 percent of the scene. With low resolution it is still 0.03 percent of the scene. However, with so few pixels on the target we can only record the path of the object. More pixels are required on the target for proper identification.
All existing recording systems have the same limitation of the number of pixels on the target required for forensic recognition. A recently widely reported crime was the reported kidnapping and killing of victim Carlie Jane Brucia. Even where the subjects are near the camera, as was the during in the abduction of Ms. Brucia, where a digital camera captured what was considered a good image, persons in the captured image could not be positively recognized from the image.
FIG. 1 is a reproduction of what is believed to be a true image 100 captured during that abduction.
The original image 100 of the abduction in FIG. 1 is 250 pixels wide by 140 high. It shows the abductor 110 and the victim 105 walking on a light-colored pavement 120. The area of the abductor's face 115 in FIG. 1 has only 195 pixels (13×15).
FIG. 2 is a reproduction of what is believed to be a true image 115 of the face of the abductor captured during that abduction.
Even digitally enlarged as in FIG. 2, the face of the abductor cannot be recognized. The victim's abduction took place so close to the camera that the abductor's face occupied one half of one percent of the scene, and even so, the image could not be used for positive recognition.
A “mug shot” 300 of the alleged abductor, reported to be one Joseph Smith, is seen in FIG. 3, and is 115,710 pixels in size even after cropping to just the area of the face. This number of pixels is almost half of a 640×480 image. It is simply not possible to have enough 640×480 cameras monitoring any real scene to obtain an image of the forensic value as FIG. 3.
Using fixed cameras, the only way to get more pixels on a target is to use more cameras or cameras with more pixels. The next generation of cameras could have HDTV (high definition television) resolution having 1,080 scan lines×1,920 pixels/line=2,073,600 pixels. If the camera at the car wash where FIG. 1 was captured had HDTV resolution, the face of the abductor would still have occupied only one half of one percent of the scene.
However, using a two megapixel sensor for video capture, that one half of one percent of scene would be 10,368 pixels.
FIG. 4 is the same “mug shot” image as image 300 in FIG. 3, but digitally reduced to an image 400 containing 10,179 pixels. It is noticeably less detailed than the 115,710 pixels of FIG. 3. The image in FIG. 4 is useful for forensic purposes. If the car wash camera where the video of the scene was taken were HDTV resolution, that image might have been usable to identify the abductor.
Yet, even a HDTV resolution camera cannot get enough pixels, for forensic purposes, of a person beyond the near field of FIG. 1.
FIG. 5 illustrates the difficulty. FIG. 5 is a crude depiction 500 of the abduction moved back by image-handling technique to the corner of the light colored pavement 120. At that location, when scaled to the pavement stripes, the people 110 and 105 in the scene are 40% as high due to the wide-angle lens used. At 40% of the height, the face of the abductor would have 0.4*0.4*10,368 pixels, that is, 1660 pixels.
FIG. 6 is the abductor's “mug shot” digitally reduced to an image 600 containing 1764 pixels. The image 600 of FIG. 6 is approaching the limits of usefulness of a HDTV resolution camera. For the car wash where Ms. Brucia was abducted, the camera could only have covered the area of the light colored pavement.
While the camera in FIG. 1 covered a visual sector on only one side of a car wash, it is normal to have even longer views on surveillance cameras covering even larger areas. It should be evident that, if a HDTV resolution camera could not even cover all of one side of a car wash with forensic quality images, then a larger parking lot would require a different conceptual design of even greater capacity.
FIG. 7 is a wide view image of a scene 700 in a parking lot. The crosshairs 750 at the near end of the parking lot mark the center of the image that is 240 feet from the camera. The far end of the parking lot is 380 feet from the camera.
FIG. 8 is a zoom or area enlargement to the far end of that parking lot, providing an image of a person 810 standing 380 feet from the camera in the same scene 700 as FIG. 7 but with the camera zoomed in to 160 mm. Crosshairs 850 point to the center of the image. FIG. 8 has 307,200 (640×480) pixels on area 800 that has only 16×12 pixels (192 pixels) in FIG. 7 amounting to 0.0625 percent of the area shown in FIG. 7.
That degree of zoom or enlargement provides 1600 [calculated as 307200/192] times as many pixels on a target as in FIG. 7.
FIG. 9 is a digital enlargement of an area 815 shown in FIG. 8, 69 pixels wide, of the face of the person 810. In FIG. 8, the face is 69 pixels wide by 89 pixels high (6141 pixels). To get the same 6141 pixels on the face with the zoom factor of FIG. 7 would require 1600 times as many pixels in a frame, being (1600*640*480) or 491,000,000 pixels. For a 24-bit color image that would amount to 1.47 gigabytes per frame. Recording 30 FPS at that resolution and uncompressed would require one terabyte of storage every 22.6 seconds. Or, with 1000:1 compression, one terabyte would store 6.29 hours of video. Clearly, a brute force higher resolution sensor will overwhelm current storage devices.
Digital Video Recorders (DVRs) currently in use have no “knowledge” of the content of the video that is being recording. The current state-of-the-art is the use of motion detection to enable recording so that storage space (usually disk drives) is not wasted when there is no motion in the camera scene. Motion detection saves storage space in a quiet scene but, when recording, still requires recording use of most of the storage recording background information. This is because without knowledge of the content of the scene, to record the part of the scene that is of interest, a DVR must be used to record the entire scene.
FIG. 10 is a video image 1010 that was taken during the well-known abduction of Ms. Brucia. Note that the DVR used the same resolution for the cracks in the pavement shown in the area 1030, as for the faces of the abductor and his victim. Detail of the cracks is enlarged in image 1035. Abductor's face enclosed in the area 115 is enlarged in image 1025. Such serves to emphasize that DVR recordation of an entire scene captured by a camera view necessarily will record very much useless information such as pixel content of such relatively useless image content as extraneous image content as pavement cracks and other insignificant background features.
For efficient recordation by DVR of only those portions of video or other digital image content in such scenes as this as will serve useful forensic purposes, such as aiding in identification of targets (subjects) of interest, it is desired to store images such portions of the scene (“areas of interest” or “regions of interest”) as will contain a target of interest, and at sufficient pixel resolution as will be useful for such purposes.
It has been proposed, as in the OSR system discussed hereinbelow, to provide object selective video analysis and recordation system in which one or more video cameras provide output video to be recorded in a useful form on recording media with reduction of the amount of the recording media, yet with preservation of the content of such images. Thus, background scenes can be recorded only periodically or at least less frequently, and at a lower pixel resolution, as will enable great savings of digital recording media, but to allow recordation of targets of interest more frequently and at higher resolution.
It would be desirable to provide a capability of combining recordation of images of overall or background scenes with recordation of a still higher resolution snapshot (still image) of an area of interest which includes a target of interest which has been electronically detected to exist within the overall or background scene. It would also be desirable to record with such images data to link the snapshots to the time and location in the overall scene from which the snapshot was acquired.
Existing State of the Art
The emerging field of intelligent video systems has enabled the long-sought feature of an “Automatic Tracking Camera.” For example, the Perceptrak system available from Cernium Inc. includes this feature. Prior to the creation of intelligent video systems with object tracking, all PTZ cameras either were on preset tours or manually controlled by operators. In addition, technology is known that is said to monitor an area with a preset tour; once motion is detected, using a camera that automatically locks onto a subject, but image lock-on schemes are not here relevant.
Among commercially offered or available products, there is indication of need to automatically control pan-tilt-zoom (PTZ) cameras. Even so, such known or proposed products demonstrate that the data gathered by such automatic PTZ control of video cameras is still considered by the industry to consist of a video stream. Such commercial products claim to produce video by tracking motion. Yet, none combines the recording of a video camera with images from an automatically controlled PTZ camera and links the location of the acquired images with a fixed camera view.
So also, among U.S. patents, the following patents relate to the field of automatically controlled PTZ cameras: U.S. Pat. Nos. 6,771,306 (Trajkovic et al.); 6,628,887 (Rhodes et al.); 6,437,819 (Loveland); and 5,434,617 (Bianchi). Each such reference discloses treatment of the data from the tracking camera as video and each such reference does not address linking the acquired images from the tracking camera to the overall view of a fixed camera.
So far as has been determined, the state of the art is that available automated video tracking systems and proposals for them have employed a tracking camera, such as a PTZ camera, wherein data acquired from the tracking camera is treated as video but the tracking images are not linked to the overall view of a fixed camera.
There remains a continuing need for automatically tracking targets to provide higher resolution of a target of interest.