Today, machine vision is mostly based on conventional cameras and their associated frame-based image sensors. For some machine vision tasks, e.g., object recognition, these conventional frame-based cameras are well suited. However, for other tasks, e.g., tracking or position and motion estimation, the conventional image sensors have drawbacks.
The main drawback is that conventional cameras produce a significant amount of redundant and unnecessary data, which has to be captured, communicated and processed. This high data load slows down the reaction time by decreasing temporal resolution, results in increased power consumption, and increases the size and cost of machine vision systems. In addition, most image sensors suffer from limited dynamic range, poor low-light performance and motion blur.
These drawbacks arise from the fact that the data is captured as a sequence of still images (frames). In some cases, encoding dynamic scenes as still images is useful to produce beautiful images and movies but not optimal for data processing, but this is less important for many machine-vision uses.
Conventional computer vision systems using conventional cameras typically compare features between sequential image frames for object recognition. To estimate the position and orientation of a mobile system and to infer a three dimensional map of the surrounding world, two sequential images, which are partially overlapping but taken at different times and from different poses, are compared. To infer the motion that occurred between the two frames, characteristic visual landmarks (key points or other visual features) have to be matched across the two images. Finding these pairs of points that correspond to each other in both images is known as solving the “correspondence problem.”
Solving the correspondence problem requires significant amount of processing power. To detect landmarks, every pixel in an image may have to be searched for characteristic features (corners, blobs, edges, etc.). The pixels and their surrounding neighborhood of pixels are then grouped to characterize the so called feature descriptors, which are then used for matching the features between the frames and thereby establishing pairs of corresponding points. This is computationally intensive. Direct approaches that directly compare pixel intensities are computationally even more complex.
On the other hand, the so-called Dynamic Vision Sensor (DVS) or event based change detection sensor is a sensor that overcomes the limitations of frame-based encoding. By using in-pixel data compression, the data redundancy is removed and high temporal resolution, low latency, low power consumption, high dynamic range with little motion blur is achieved. DVS is thus well suited especially for solar or battery powered compressive sensing or for mobile machine vision applications where the position of the system has to be estimated and where processing power is limited due to limited battery capacity.
The DVS pre-processes visual information locally. Instead of generating crisp images, the DVS produces smart data for computer applications. While conventional image sensors capture a movie as a series of still images, the DVS detects and only transmits the position of changes in a scene. It thus encodes the visual information much more efficiently than conventional cameras because it encodes in-pixel data compression. Specifically, rather than encoding frames as image data, the DVS detects changes and encodes those changes as change events. This means that the processing of the data is possible using less resources, lower net power and with faster system reaction time. The high temporal resolution allows continuously tracking visual features and thereby overcoming the correspondence problem. Additionally, the architecture of DVS allows for high dynamic range and good low-light performance.