Visual surveillance of dynamic scenes is an active area of research in robotics and computer vision. The research efforts are primarily directed towards object detection, recognition, and tracking from a video stream. Intelligent visual surveillance has a wide spectrum of promising government and commercially-oriented applications. Some important applications are in the field of security and include access control, crowd control, human detection and recognition, traffic analysis, detection of suspicious behaviors, vehicular tracking, Unmanned Aerial Vehicle (UAV) operation, and detection of military targets. Many other industrial applications in the automation fields also exist, such as faulty products detection, quality assurance, and production line control.
Commercial surveillance systems are intended to report unusual patterns of motion of pedestrians and vehicles in outdoor environments. These semi-automatic systems are further intended to assist, but not replace, the end-user. In addition, electronics companies provide suitable equipment for surveillance. Examples of such equipment include active smart cameras and omnidirectional cameras. All of the above provide evidence of the growing interest in visual surveillance, where, as in many image processing applications, there is a crucial need for high performance real-time systems. A bottleneck of these systems is primarily hardware-related, including capability, scalability, requirements, power consumption, and ability to interface various video formats. In fact, the issue of memory overhead prevents many systems from achieving real-time performance, especially when general purpose processors are used. In these situations, the typical solutions are either to scale down the resolution of the video frames or to inadequately process smaller regions of interests within the frame.
Although Digital Signal Processors (DSPs) provide improvement over general purpose processors due to the availability of optimized DSP libraries, DSPs still suffer from limited execution speeds. Thus, DSPs are insufficient for real-time applications. Field programmable gate array (FPGA) platforms, on the other hand, with their inherently parallel digital signal processing blocks, large numbers of embedded memory and registers, and high speed memory, together with storage interfaces, offer an attractive solution to facilitate hardware realization of many image detection and object recognition algorithms. As a result, computationally expensive algorithms are usually implemented on an FPGA.
State of the art developments in computer vision confirm that processing algorithms will make a substantial contribution to video analysis in the near future. Once commercialized, the processing algorithms may overcome most of the issues associated with the power and memory demanding needs. However, the challenge to devise, implement, and deploy automatic systems using such algorithms to detect, track, and interpret moving objects in real-time remains. The need for real-time applications is strongly felt worldwide, by private companies and governments directed to fight terrorism and crime, and to provide efficient management of public facilities.
Intelligent computer vision systems demand novel system architectures capable of integrating and combining computer vision algorithms into configurable, scalable, and transparent systems. Such systems inherently require high performance devices. However, many uncharted areas remain unaddressed. For example, only a single hardware implementation attempt has been reported for a Maximally Stable Extremal Regions (MSERs) detector and that attempt was met with limited success. This is despite the fact that MSERs detectors were introduced as a research topic more than a decade ago, have been used in numerous software applications, and been discussed in over 3,000 published papers. The major advantages of MSERs are affine invariance. Traditional scale invariant feature transform (SIFT) detectors and speeded up robust features (SURF) detectors are only scale and rotation invariant.
Moreover, classical MSER and SIFT algorithms tend to be far more computationally complicated than a linear-time MSERs algorithm. For example, one of the preprocessing steps for SIFT detection is the construction of the Scale-Space using the Pyramid of Gaussian. In this step, multiple versions of the scaled input frame are stored to be used later for the SIFT detection. This requires additional memory space as compared to storing one single version of the input frame to be processed directly via the linear-time MSERs algorithm. Additionally, each of these scaled versions of the input framed are filtered (convolved) with a smoothing filter, the SIFT inventor, which means extra processing (additions, multiplication, and memory read/write accesses) are required, and hence more power will be consumed. In the case of linear-time MSER, the extra processing steps are not necessary.
What is needed is a hardware architecture for linear-time extraction of MSERs. The architecture can be easily realized with e.g. an FPGA or an application specific integrated circuit (ASIC) or the like.