Nowadays, multimedia content is available in abundance. The multimedia content may include, video footage, images, audio recording and the like. Most of the multimedia content is used for purposes like education, surveillance, security systems, medical investigations and the like. Generally, the video footage includes more information than an image or an audio content. Therefore, retrieving sensible information from the video footage and effective utilization of information in the video footage is of utmost importance. Some video footages are extremely long and may include a lot of redundant information. Analyzing such long video footages including redundant information may be a tiring process and consumes a lot of time and effort of people. Further, since the video footage consists of both audio data and visual data, analyzing both these contents is important. However, the existing systems utilize only visual descriptors of the video footage for obtaining the sensible information and audio descriptors of the video footage are ignored.
Few of the existing techniques consider the video descriptors and audio descriptors for extracting the sensible information from the video footage. However, even if the audio descriptors are considered, there is a high chance that a correct overview of the video footage may not be captured in the extracted sensible information or summary since direct translation of the audio from the video footage may not identify context of the video footage. Further, some other existing techniques disclose summary generation based on motion detected in the video footage. Events of interest within the video footage are identified based on the corresponding metadata, and best scenes are identified based on the identified events of interest. Motion values may be determined for each frame and portions of the video footage including frames with the most motion that are identified as best scenes. A video summary can be generated including one or more of the identified best scenes. However, the summary generated using this technique may lead to gaps in understanding due to missing continuity between each scene of the video footage and context of the events. The missing continuity may occur since only the best scenes are considered for generating the video summary. The missing context may occur since the mood of the video is not considered. Further, the best scenes may include the redundant information of the video footage which is not eliminated.