Multiple cameras may be deployed at an event to simultaneously capture video streams or images from different angles and transmit the captured video streams or images to an editing device. For convenience, video streams and images are collectedly referred to as media clips. Human operators may act as the editor to decide which stream of video contains the region of interest (e.g., the most salient object or person) and select the best video feed among the multiple video streams for any given moment. Lower-cost systems (such as video conference systems) can try to accomplish video editing automatically without the human editor. Currently, some of the automated systems try to determine the best video feed based on sound volumes. For example, the automated systems may select the video stream that has the highest sound volume as the one that captures the region of interest. However, the sound volume may not be a good indicator when sound signals are amplified by sound amplification systems, and it does not provide any information as to which particular region of a video stream is the region of interest. Other systems use the amount of motion in video streams as an indicator of region of interest. However, the amount of motion may not be reliable for certain situations. For example, the speaker at a meeting may not move much, but is nevertheless the center of attention for other individuals present at the meeting.