1. Field of the Invention
Embodiments of the invention generally relate to techniques for analyzing digital images. More specifically, embodiments presented herein provide a variety of techniques for effectively and efficiently segmenting foreground and background elements in a stream of video frames trained on a scene.
2. Description of the Related Art
Video analytics generally refers to applications that evaluate digital image data, and a variety of approaches have been developed to programmatically evaluate a video stream. For example, some video analytics systems may be configured to detect a set of pre-defined patterns in a video stream. Many video analytics applications generate a background model to evaluate a video stream. A background model generally represents static elements of a scene within a field-of-view of a video camera. For example, consider a video camera trained on a stretch of roadway. In such a case, the background would include the roadway surface, the medians, any guard rails or other safety devices, and traffic control devices, etc., visible to the camera. The background model may include an expected (or predicted) pixel value (e.g., an RGB or grey scale value) for each pixel of the scene when the background is visible to the camera. The background model provides a predicted image of the scene in which no activity is occurring (e.g., an empty roadway). Conversely, vehicles traveling on the roadway (and any other person or thing engaging in some activity) occlude the background when visible to the camera and represent scene foreground objects.
To process a live camera feed, a background model needs to segment scene foreground and background at or near the same frame rate of a video analytics system. That is, a video analytics system should be able to segment foreground from background for each frame (or every N frames) dynamically while processing a live video feed.
However, a variety of challenges arise in generating a background model. For example, the video channel may be noisy or include compression artifacts. In addition, the nature of the scene itself can make it difficult to generate and maintain an accurate background model. For example, ambient lighting levels can change suddenly, resulting in large groups of pixels being misclassified as depicting foreground. In these cases, it becomes difficult to classify any given pixel from frame-to-frame as depicting background or foreground, (e.g., due to pixel color fluctuations that occur due to camera noise or lighting changes). A background model also needs to respond to gradual changes in scene lighting.
Similarly, some elements of a scene that would preferably be categorized as background can be detected as foreground objects, e.g., a traffic light changing from green to yellow to red or an elevator door opening and closing. The changes can result in elements of the traffic light (as captured in pixel data) being incorrectly classified as depicting scene foreground. Other examples of a dynamic background include periodic motion such as a scene trained on a waterfall or ocean waves or tree branches bending in a breeze. While these changes in the scene are visually apparent as changes in pixel color from frame-to-frame, they should not result in the pixels being classified as elements of scene foreground. Further, as objects enter the scene, they may, effectively, become part of the scene background (e.g., when a car parks in a parking spot). Because other components in a video analytics system may track each foreground object from frame to frame, such false or stale foreground objects waste processing resources and can disrupt other analytics components which rely on an accurate segmentation of scene foreground and background.
One approach to modeling such scenes is to create a complex background model which supports multiple background states per pixel. However, doing so results in a background model where processing requirements scale with the complexity of the scene. This limits the ability of a video analytics system to analyze a large numbers of camera feeds in parallel.