The present disclosure relates generally to the field of automatic synchronization of subtitles based on audio fingerprinting. In various embodiments, systems, methods and computer program products are provided.
Conventionally, subtitles provide users with audio description and captions representative of events related to the audio and video contents of multimedia streams. Subtitles are frequently used in noisy environments (when the audio contents cannot be perfectly heard, as is often the case of movies watched in airplanes), by hearing impaired persons, by non-speakers of the languages available in the audio streams, and many others.
The golden standard for subtitle generation is the creation of a file containing, essentially, the texts to be displayed and the moment of that exhibition. In professional production, the setting is very well controlled. That is, groups of people are designated to create subtitles for the produced contents and to certify that these subtitles are properly synchronized with the elementary audio and video streams of those contents. The final contents are often packed together in a “transport stream” file that can be later pressed on DVDs, Blu-Rays, or broadcast on TVs. In such professional production, major synchronization problems between a multimedia stream and the subtitles do not typically occur.
On the other hand, non-professional production of subtitles is often based on desktop software that lets users determine sequences of starting times in which a certain text must be exhibited on the screen (and for how long the text should be shown). The resulting subtitles are then saved to popular file formats that most multimedia playback software is able to interpret. Many communities on the Internet (such as online caption databases) are dedicated to sharing these subtitles in a wide range of languages.
With the widespread use of multimedia files and the wide range of hardware platforms in which these files can be played, it is not uncommon that the same multimedia content is available in different formats and resolutions. A consequence is that, depending for example on various newly utilized encoding settings (e.g. different frame rate) and on the presence of modifications to the original contents (e.g., to insert or delete advertisements), there may be multiple versions available for the same content. When such multiple versions are available for the same content, the synchronization of the resulting media with existing subtitles can be compromised, resulting in text messages (that is, subtitles or captions) that are displayed on the screen earlier or later than the corresponding audio or visual events.
FIG. 1 shows an example (using conventional techniques) of such compromised synchronization. More particularly, each of captioning file 101 and captioning file 103 had been intended to be used to caption the same multimedia file (not shown). However, the captioning file 101 and captioning file 103 do not provide the same text at the same time. For example, in caption 615 (see line 2) of captioning file 101 the text is “SPEAKER 1: XXXXXXX XXXXXXXXXXXXXXXXXXXX” (see lines 4 and 5) and the display start and end times are 00:50:02,280 to 00:50:06,046 (see line 3). On the other hand, in the corresponding caption 729 (see line 2) of captioning file 103, where the text is “SPEAKER 1: XXXXXXX XXXXXXXXXXXXXXXXXXXX” (see lines 4 and 5) the display start and end times are 00:48:25,351 to 00:48:28,937 (see line 3).
In another example of a conventional technique, some well-known media players (such as VLC) present the ability of downloading a subtitle file corresponding to a media file to be played. The downloading of the subtitle file is based on the name of the media file to be played. If there is any discrepancy with respect to the time the content of the media file is to be displayed (compared to the file used as a template for the subtitle generation), there will be a mismatch between the sound and the subtitles. It happens in this instance because, in the subtitle file, the moment of exhibition of each subtitle is hardcoded, and not based on the content being displayed.