Multiple cameras may be deployed at an event to simultaneously capture video streams or images from different angles and transmit the captured video streams or images to a device for annotation and/or editing. A human operator may act as an editor to decide which stream of video contains a region-of-interest (e.g., the most salient object or person) and select the best video feed among the multiple video streams for any given moment. Lower-cost systems, such as video conference systems, may attempt to perform video editing automatically (without the human editor). Currently, some automated systems utilize sound volumes as a basis for determining the best video feed. For example, the automated systems may select the video stream that has the highest sound volume as the one that best captures the region-of-interest. However, the sound volume may be a poor indicator when sound signals are amplified by sound amplification systems, and it does not provide any information as to which particular region of a video stream is the region of interest. Other systems use the amount of motion in video streams as an indicator of the region-of-interest. However, the amount of motion may not be reliable for certain situations. For example, the speaker at a meeting may move too little to serve as a suitable basis for motion analysis, but is nevertheless the center of attention for other individuals present at the meeting.