In computer vision applications, annotation is commonly known as the marking or labeling of images or video files captured from a scene, such as to denote the presence and location of one or more objects or other features within the scene in the images or video files. Annotating a video file typically involves placing a virtual marking such as a box or other shape on an image frame of a video file, thereby denoting that the image frame depicts an item, or includes pixels of significance, within the box or shape. Other methods for annotating a video file may involve applying markings or layers including alphanumeric characters, hyperlinks or other markings on specific frames of a video file, thereby enhancing the functionality or interactivity of the video file in general, or of the video frames in particular. Two common reasons for annotating video files are to validate computer vision algorithms, e.g., to compare an actual location of an item appearing in a video file to a location of the item as determined by one or more of such algorithms, and also to train computer vision algorithms, e.g., to feed an actual location of an item within an image frame to a computer vision algorithm in order to train the computer vision algorithm to recognize that the item is in that location in the image frame.
Traditional manual and automatic methods for annotating video files have a number of limitations, however. First, annotating a video file is very time-consuming for a human, who must visibly recognize the location of an item in a visual image frame and also annotate an appropriately sized box or other shape around the item within the visual image frame. Next, most automatic methods for annotating a video file are computationally expensive, in that automatic video annotation typically requires feeding an algorithm with a number of positive examples of an object of interest to be recognized within a video file (e.g., a specific body part, a specific commercial good, a specific license plate or other object that is to be recognized within the video file, along with specific locations of the object within the video file), and also a number of negative examples of the object of interest (e.g., items that are visibly distinct from the specific body part, the commercial good, license plate or other object, or a confirmation that a given image frame does not include the target object). Such automatic methods may require an intense amount of data and processing power, however, in order to optimize their chances of success. For example, a ten-minute video file that was captured at a rate of thirty frames per second includes 18,000 image frames, each of which must be specifically marked with locations of objects of interest depicted therein, or designated as not depicting any such objects.
Manual and automatic methods for video annotation are particularly limited in environments where a number of imaging devices (e.g., digital cameras) are aligned to capture imaging data from a single scene. In such environments, the manual labor and/or processing power that may be required in order to properly annotate video files captured by each of such devices is exponentially increased.