To a human, video data often contains readily distinguishable types of information, depending on context. For example, in the context of videoconferencing, one's attention may naturally be drawn to the participants, whereas other elements of the scene (e.g., furniture, windows, etc.) may be of secondary importance.
A human camera operator can readily be directed to frame a scene in a manner that reflects the features of interest—e.g., in order to include a person who is currently speaking. Automating such a process presents considerable challenges, however. It is possible to develop feature detection models that analyze a video scene to determine whether expected features are present, such as human faces, and then use such information to determine how to process the scene. Such models tend to be difficult to develop and computationally complex to deploy, potentially increasing system costs.
Embodiments of this disclosure may be used to address the complexities of identifying areas of interest within video data, as well as related issues such as how to address calibration of a system under dynamically changing circumstances.