The advance of high-speed mobile Internet and capacity of user devices, such as mobile phones, smartphones and tablets, has given rise to a new way of consuming mobile live video streaming services. There is also a high demand from users to film a social event, e.g. a football game or a music festival in order to present the users' own version of storytelling. The emerging applications allow users to produce videos collaboratively using multiple mobile cameras in a manner similar to how professional live TV is produced. As shown in FIG. 1, the scenario includes three user roles, namely producers, directors and consumers. The producers are users with user devices 1, 2, 3 who collaboratively record and stream video feeds, for example in a stadium, to application servers or system 10. A mix of video feeds enables the directors to conduct video direction and rich-content assertion. The consumers are able to watch live broadcast of the event from different viewpoints based on the directors' selection rather than only few options provided by traditional TV broadcasting.
In a social multimedia environment, it is desirable for directors to monitor synchronized bitstreams from the producers. Simply simultaneously sending each bitstream to its physical output hardware will not necessarily ensure synchronization. In professional live video production, the synchronization among multiple camera feeds is done by specialized hardware. However, this approach is not practical when streaming video from user devices 1, 2, 3 via wireless connections. The reason being that delay is an inherent feature of wireless networks and network congestion often happens when the volume of data traffic goes up. This implies that each user device 1, 2, 3 experiences different network delays, which may further vary for a given user device 1, 2, 3 over time. As a consequence, the differences and variations in network delay cause the arrival time of each video stream at the system 10 to be different. The divergence in arrival time has great impact on the perceived video frames resulting in asynchrony in the live feeds presented to the directors. This means that the directors will not be able to edit the multiple bitstreams in a synchronized manner. As shown in FIG. 2 illustrating bitstreams or video streams 81, 82, 83 from user devices 1, 2, 3, the marked video frames 91, 92, 93 are taken by the cameras of the user devices 1, 2, 3 at the same time. Due to network delay, the time when the marked video frames 91, 92, 93 arrive at the system 10 is different. Thus, one of the most import requirements of social video streaming is adequate synchronization so that each video stream is aligned to each other. The multi-producer video filming turns out to be a problem of asynchrony, which has to be solved.
Various techniques for achieving synchronization among video streams have been proposed in the art.
In a solution clock synchronization is used. Synchronization offsets are calculated using timestamps generated by the cameras' internal clocks on the user devices. This solution is one of the most processing efficient methods. However, some user devices do not have an internal high-resolution clock. Thus, clock drift and skew may cause the user devices out of synchronization. In addition, the solution requires all the user devices to synchronize with a centralized Network Time Protocol (NTP) server. The transmission delay between each user device and the system would also vary from each other, especially when wireless network is highly congested.
In another solution audio fingerprints are extracted from audio streams and compared to find a match among all the audio streams when multiple cameras are recording the same event. By comparing the occurrence of similar sound matches, it may be possible to calculate the synchronization offset. However, this solution requires all the user devices to be close enough to the event since the speed of sound is much slower than the speed of light. The sound, recorded by a user device that is closer to the sound source, could be up to one second ahead as compared to the sound recorded by another user device, when watching a sport game in a large stadium. Furthermore, the noise generated by the crowds would also decrease the accuracy of finding suitable audio fingerprints. This means that audio fingerprinting will generally not be very reliable to achieve video frame synchronization involving multiple user devices.
In a further solution external hardware synchronized cameras or so-called inter-camera synchronization is assumed. Such a solution requires physically connecting the cameras of the user devices to external synchronization hardware. It is often used in professional live video production. However, in the social video streaming scenario, synchronizing all users' user devices in a social event is not practical and nearly impossible.
In yet another solution timestamps are added to the video streams by having new features implemented in base stations in cellular or mobile communication networks. However, a problem is that not all user devices are connected to the Internet with the same network provider, and some of them may be connected via Wireless Local Area Network (WLAN) provided by the event organizer. In order to overcome such a problem, this solution has to access each base station and WLAN access provider, which introduces complicated management issues in heterogeneous networks and increases corresponding cost.
A further solution involves analyzing the incoming video streams, and monitoring the sequence of video frames for the occurrence of at least one of a plurality of different types of visual events. The occurrence of a selected visual event should be detected among all the video streams and taken as a marker to synchronize all video streams. However, this solution requires all user devices recording at least one common visual event in order to find the marker among all the video streams from each user device. If the user devices are focusing on different parts of the event, there is no way for this solution to identify the marker.
U.S. Pat. No. 6,317,166 discloses a synchronization frame generator that is used for creating simultaneous easily visible synchronization markers as part of a multi-channel image generating system. A simple detection circuit can be used to detect the unique synchronization frames during payback of any recording made from a multi-camera system.
There is therefore a need for an efficient solution to achieve synchronization of bitstreams originating from different user devices.