A video conference system includes an endpoint that captures audio and video of participants in a room during a conference, for example, and then transmits the audio and video to a conference server or to a “far-end” endpoint. The video conference system may frame close-up or zoomed-in camera views of talking participants (i.e., talkers). The video conference system may detect faces in the captured video to assist with framing the close-up camera views. Speaker tracking improves the meeting experience by showing close-up views of the active speakers.