The current era is an era of digital media that consists of still images in abundance, that are captured through image capturing devices such as camera, mobile and the like. The still images capture only the actions and objects of a certain moment but do not capture sounds associated with the actions and objects, thereby failing to provide experience of a video. As an example, when people go through still images of a vacation they had been to, the audio behind those still images at the time and location when the image was captured would not be present. Capturing a video may provide the audio as well, but videos consume a lot of storage space.
Few of the existing techniques use digital still images and generate short Graphics Interchange Format (GIF) videos. This technique uses a generative adversarial network for a video, with a convolutional architecture that untangles the scene's foreground from the background and generates tiny videos up to a second at full frame rate better than simple baselines. However, this technique does not synthesize audio/sounds that could possibly have been associated with scene present in the digital still images.
Further, the existing techniques disclose displaying image combined with playing audio in an electronic device. In this technique, the audio of objects in the image are extracted and played individually. Therefore, the holistic audio of the image is not achieved as multiple audios corresponding to different objects in the image are played individually. The audio of the objects achieved using this technique is static that does not retain dynamics of the image to produce the overall audio of the image. Another existing technique discloses identifying and filtering out uncorrelated audio data for various images, which in turn provides a filtered collection of correlated audio-visual examples. Further, the suitable audio is selected from the video having similar image frames with more focus provided to a particular object's activity in the image. However, this technique also fails in providing the holistic audio that would have existed while capturing the image.