The proliferation of low cost and high quality digital capture devices such as digital cameras and smart phones has resulted in vast collections of individual and shared digital imagery, both still and video. Viewing the media assets in these ever-growing collections has become increasingly difficult due to the sheer volume of content. However, recently mechanisms for automatically or semi-automatically selecting and presenting desired subsets of the collections have been made available, enabling these subsets to be shared and relived. While printing hard copy photo products is often the preferred way to come up with archival keepsakes, in many cases a softcopy rendering is best for spontaneous viewing or for sharing with friends and family. Such soft copy renderings may take many forms, from a simple digital slideshow to an animated presentation of imagery. However, while such presentations stimulate the visual senses, they leave the other human senses unengaged. Accompanying the visual presentation with at least an audio component can result in a more pleasant viewing or playback experience. Even when such softcopy renderings include video assets, and those video assets incorporate an audio track, the snippets may form only a fraction of the overall rendering, and the audio quality associated with the video may be of inferior quality. Fundamentally, viewing digital renderings or slideshows is often boring without an accompanying audio component.
Prior work published as “Matching Songs to Events in Image Collections,” (M. D. Wood, 2009 IEEE International Conference on Semantic Computing) described a system for correlating songs from a personal library of music with event-based temporal groupings of image assets by correlating semantic information extracted from the imagery with song lyrics. However, this approach required the presence of a music library annotated with lyrics, and only worked on for songs, not instrumental music.
Prior approaches for creating audio tracks include “Generating Music From Literature” by Davis and Mohammad, wherein the authors describe an approach for automatically generating musical compositions from literary works. That work takes the text of a novel, and synthesizes music based upon the distribution of emotive words. It leverages the NRC Word-Emotion Association Lexicon, a mapping of English language words to emotions which was made via crowdsourcing:
http://www.musicfromtext.com/uploads/2/5/9/9/25993305/_transprose_final.pdf
http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
The work by Davis and Mohammad includes an analysis of the text of the novel to identify emotion densities, considering the eight emotions of anticipation, anger, joy, fear, disgust, sadness, surprise and trust. The novel is divided into a predetermined number of sections, and the ratio of emotion words to the total number of words in that section is computed, to derive an overall emotional density for that section. Changes in emotional density change the music. The system described by Davis and Mohammad, TransProse, implements a mechanism for generating a sequence of notes based upon changes in emotion in a literary work. While the current invention builds in some respects upon this work in the use of emotive concepts, that is only one aspect of the current invention, and the application is significantly different. Rather than operating over arbitrary groupings of text, the system and method of the present invention operates over sequences of images, grouped logically by theme or temporal constructs. Emotion is only one of many factors considered in the synthesis.
In “Algorithmic Songwriting with ALYSIA,” (Margareta Ackerman and David Loker, “Algorithmic Songwriting with ALYSIA,” International Conference on Computational Intelligence in Music, Sound, Art and Design (EvoMUSART), 2017), the authors describe a system based upon machine learning for composing lyrical musical pieces. ALYSIA is primarily intended to be a tool, assisting the user in composing and scoring musical pieces, but it is another demonstration of the use of algorithmic tools to automatically compose music. Another example is “Song From PI: A Musically Plausible Network for Pop Music Generation,” which uses hierarchical recurrent neural networks, a form of machine learning, to generate music. The authors of this work include a description of an application for generating a song about a solitary image, where they use the literary story composer by Kiros et al. (http://papers.nips.cc/paper/5950-skip-thought-vectors.pdf) to generate text which is then fed into their system to generate music. A comprehensive survey of methods for algorithmically composing music is available as “AI Methods in Algorithmic Composition” by Fernandez and Vico (https://jair.org/media/3908/live-3908-7454-jair.pdf).
There remains a need in the art for a system that is capable of generating an audio component to accompany a softcopy rendering of a series of digital images, particularly a system where the audio is generated in a manner sensitive to the visual, semantic and emotive nature of the image content, each with a potentially different form. In addition, a system is needed that is capable of generating representations that include thematic groupings in addition to the traditional purely sequential groupings.