Visual surveillance of dynamic scenes is an active area of research in robotics and computer vision. The research efforts are primarily directed towards object detection, recognition, and tracking from a video stream. Intelligent visual surveillance has a wide spectrum of promising governmental and commercial-oriented applications. Some important applications are in the field of security and include access control, crowd control, human detection and recognition, traffic analysis, detection of suspicious behaviors, vehicular tracking, Unmanned Aerial Vehicle (UAV) operation, and detection of military targets. Many other industrial applications in the automation fields also exist, such as faulty products detection, quality assurance, and production line control.
Commercial surveillance systems are intended to report unusual patterns of motion of pedestrians and vehicles in outdoor environments. These semi-automatic systems intend to assist, but not to replace, the end-user. In addition, electronics companies provide suitable equipment for surveillance. Examples of such equipment include active smart cameras and omnidirectional cameras. All the above provide evidence of the growing interest in visual surveillance, where, as in many image processing applications, there is a crucial need for high performance real-time systems. A bottleneck of these systems is primarily hardware-related, including capability, scalability, requirements, power consumption, and ability to interface various video formats. In fact, the issue of memory overhead prevents many systems from achieving real-time performance, especially when general purpose processors are used. In these situations, the typical solutions are either to scale down the resolution of the video frames or to inadequately process smaller regions of interests within the frame.
Although Digital Signal Processors (DSPs) provide improvement over general purpose processors due to the availability of optimized DSP libraries, DSPs still suffer from limited execution speeds. Thus, DSPs are insufficient for real-time applications. Field programmable gate array (FPGA) platforms, on the other hand, with their inherently parallel digital signal processing blocks, large numbers of embedded memory and registers, and high speed memory, together with storage interfaces, offer an attractive solution to facilitate hardware realization of many image detection and object recognition algorithms. As a result, computationally-expensive algorithms are usually implemented on an FPGA.
State-of-the-art developments in computer vision confirm that processing algorithms will make a substantial contribution to video analysis in the near future. The processing algorithms once commercialized may overcome most of the issues associated with the power and memory-demanding needs. However, the challenge to devise, implement, and deploy automatic systems using such algorithms to detect, track, and interpret moving objects in real-time remains. The need for real-time applications is strongly felt worldwide, by private companies and governments directed to fight terrorism and crime, and to provide efficient management of public facilities.
Intelligent computer vision systems demand novel system architectures capable of integrating and combining computer vision algorithms into configurable, scalable, and transparent systems. Such systems inherently require high performance devices. However, many uncharted areas remain unaddressed. For example, only a single hardware implementation attempt has been reported for a Maximally Stable Extremal Regions (MSERs) detector and the attempt had limited success. This is in spite of the fact that MSERs detectors were introduced as a research topic more than a decade ago, have been used in numerous software applications, and discussed in over 3,000 published papers. The major advantages of MSERs are affine invariance. Traditional scale invariant feature transform (SIFT) detectors and speeded up robust features (SURF) detectors are only scale and rotation invariant. Another disadvantage is that traditional union-find algorithms used in labeling regions require two processing passes. One pass is for labeling light regions and a second pass is for labeling dark regions, or vice versa. This traditional requirement for two passes is relatively very inefficient.
What is needed is architecture and a method for real-time extraction of MSERs in an efficient manner such that union-find labeling can label both light and dark regions substantially simultaneously. In at least one embodiment, the architecture can be relatively easily realized with e.g. an FPGA or an application-specific integrated circuit (ASIC) or the like to realize a System-on-Chip (SoC) to relatively greatly increase processing speed in comparison to traditional systems that extract MSERs.