A presentation generated on a processing system can be a complex mixture of audio media objects (e.g., music recordings) and visual media objects (e.g., video recordings or one or a group of digital photos). A user may require significant amounts of time and resources to manually create and edit an audio-visual presentation (e.g. a slideshow with accompanying soundtrack or a music video). Many conventional techniques for generating an audio-visual presentation do not give consideration to how visual media objects are temporally combined with audio media objects when the visual media objects are presented in relation to the audio media objects, and what criteria are used to combine visual media objects with audio media objects. These conventional techniques may result in unpleasant combinations of visual media objects with audio media objects.
For example, one or more parts (e.g., audio media objects and visual media objects) may not thematically fit with other parts. Visual media objects readily available and selected to accompany audio media objects may not be similar to the mood, pace, or other characteristics of the audio media objects. Other times, a visual media object set (e.g. a photo slideshow) may set the mood and pace for selecting and playing an audio media object (e.g. a music recording). However, audio media objects readily available and selected to accompany the visual media objects may not be similar to the mood, pace, or other characteristics of the visual media objects. Furthermore, manually timing the presentation of a visual media object set with the pace of the audio media objects may prove difficult for the end user. Therefore, it may be desirable to automatically generate a presentation employing characteristics of the audio and visual media objects to synchronously present both audio and visual media objects.