Multi-media such as images, videos and audios are captured by one or more multi-media capturing devices. The multi-media capturing devices may include a camera, a video recorder, a microphone and any such devices. For example, if a video of a large area is to be captured, plurality of multi-media capturing devices, for example cameras and microphones, are spread across the large area such that the video of each region of the large area is captured. Sometimes, few areas in the large area are given more priority, and accordingly the video of the prioritized area is captured. In such systems, there may be a need for a human to operate the plurality of multi-media capturing devices in order to obtain the required multi-media in the large area. One or more existing techniques provide an automated method for capturing multi-media such as a video or an audio in an area using plurality of multi-media capturing devices. One of the existing techniques may trace moving objects and capture the multi-media of the objects using one or more multi-media capturing devices. One of the existing techniques detects voice, gesture and gaze of persons in an area and may capture the multi-media of persons. However, systems with existing techniques may require a human intervention for operating the capturing devices. Some of the systems with the existing techniques may not be configured to obtain activities data of the entire area for capturing the multi-media of area of interest. Thereby the systems may not be able to efficiently capture important area of interest at right time.