Digital images and video have become prevalent in modern society as well as the devices that capture them. Digital cameras are not only a commonly carried item, digital imaging devices are now utilized in many new ways and are embedded within many new devices and machines. Such widespread and common use of digital imaging devices creates a lot of data and a lot of opportunity to identify items of interest within individual images, either still or video frame images, or between two or more images or video frames. For example, video captured by an imaging device of an autonomous driving vehicle can be utilized to track a road, obstacles, and other vehicles on the road to assist in automated operation thereof. However, such image processing and flow tracking typically involves a great amount of data processing at least because of an amount of data to be processed in each image, a high number of images to be processed (e.g., 30 or 60 frames per second), and possibly a large number of items to identify and track in and between images. However, to these ends, deep convolutional neutral networks have achieved great success on image recognition tasks. Yet, it is nontrivial to transfer the state-of-the-art image recognition networks to videos as per-frame evaluation is too slow and computationally expensive.