Wearable technology is on the rise for both business and personal use. Various wearable devices such as intelligent eyewear, smart watches, and hi-tech clothing have found applications in various domains including those related to medical, gaming, industrial, fitness and lifestyle. This has allowed for ubiquitous computing through wearable sensing over the last decade. One of the most common applications of wearable sensing is to capture an egocentric video using an egocentric camera for analysis. The egocentric video provides a first-person view of events depicted by video sequences.
The egocentric video is typically a combination of relevant and non-relevant video segments based on an intended application. For example, police officers can wear egocentric cameras to record interactions with defaulters (e.g., drunk drivers) and also, ambient surroundings for additional cues. Therefore, it is critical to automatically analyze the relevant video segments (e.g., misdemeanour by a defaulter) while ignoring the non-relevant video segments (e.g., walking towards the car, drinking coffee, etc.) for efficient insights during jurisdiction. Similarly, egocentric video analysis may be performed to improve education and cognitive healthcare.
One conventional approach for the automatic analysis of egocentric videos involves patterns of attention and social interactions being identified as relevant video segments based on a combination of audio and visual (AV) cues in the egocentric videos. However, use of such AV combination cannot be generalized to identify a relevant video segment in case of multiple activities being performed simultaneously. For example, the AV approach cannot correctly identify different food preparations as the relevant video segments when a user is cooking while speaking on a phone. Another approach identifies the relevant video segments through classification of pre-segmented activities in the egocentric video. However, it essentially requires all video segments to be processed individually, thereby increasing the computational complexity.
Further, the traditional approaches typically analyze the behavior of objects or persons in an egocentric video. However, they do not focus on the activities performed by a user who is wearing the camera to capture the egocentric video. As a result, the user behavior is not analyzed effectively.
Therefore, there exists a need for a computationally efficient method that reliably performs activity analysis of users in an egocentric video.