In computer vision applications, annotation (or labeling) is commonly known as the marking or designation of images or video files captured from a scene, such as to denote the presence and location of one or more objects or other features within the scene in the images or video files. Annotating an image or a video file typically involves placing a virtual marking such as a box or other shape on an image or one or more frames of a video file, thereby denoting that the image or the frame depicts an item, or includes pixels of significance, within the box or shape. Other methods for annotating an image or a video file may involve applying markings or layers including alphanumeric characters, hyperlinks or other markings on specific images or frames of a video file, thereby enhancing the functionality or interactivity of the image or the video file in general, or of the images or video frames in particular. Locations of the pixels of interest may be stored in association with an image or a video file, e.g., in a record maintained separately from the image or the video file, or in metadata of the image or the video file.
Two common reasons for annotating images or video files are to train computer vision algorithms, e.g., to feed an actual location of an item within an image or a video file to a computer vision algorithm in order to train the computer vision algorithm to recognize that the item is in that location within the image or video file, and also to validate computer vision algorithms, e.g., to compare an actual location of an item appearing in an image or a video file to a location of the item as determined by one or more of such algorithms.
Traditional manual and automatic methods for annotating images or video files have a number of limitations, however. First, annotating an image or a video file is very time-consuming for a human, who must visibly recognize the location of an item in an image or video file and also annotate an appropriately sized box or other shape around the item within the image or a frame of the video file. Next, most automatic methods for annotating an image or a video file are computationally expensive, and may require an intense amount of data and processing power in order to optimize their chances of success. For example, a ten-minute video file that was captured at a rate of thirty frames per second includes 18,000 image frames, each of which must be specifically marked with locations of objects of interest depicted therein, or designated as not depicting any such objects.