A video conference system includes an endpoint that captures audio and video of participants in a room during a conference, for example, and then transmits the audio and video to a conference server or to a “far-end” endpoint. The video conference system may frame closeup or zoomed-in camera views of talking participants (i.e., talkers). The video conference system may detect faces in the captured video to assist with framing the closeup camera views. Often, the video conference system frames a camera view of a talker that is significantly wider (i.e., more zoomed-out) than is desired because the video conference system is unable to detect a face of the talker. This occurs, for example, when the talker is not facing the camera. As a result, the video conference system frames zoomed-out camera views instead of more appropriate closeup views, and degrades the user experience.