Content based multimedia search has gained a lot of attention with the rapid increase in multimedia quantity and quality. As the ability to broadcast video content (including games) has gone beyond television to the Internet and mobile phones, video advertising is becoming an attractive and plausible source of revenue. While today video advertising accounts for only a minuscule proportion of media budgets, it presents a significant opportunity for advertisers to extend the reach of their campaigns with compelling content. This demands to select relevant advertisements for targeting viewers based on the video content. There is a definitive need to determine the deeper semantics of the video and to select relevant advertisements based on the semantics. In order to provide a deeper semantics for a multimedia content, it is necessary to use the prevailing structure of the multimedia content effectively. For example, in computer vision, the processing is organized at various levels: low level, syntactic analysis, intermediate level, structural analysis, and high level semantic analysis. A typical medium to long duration multimedia is structured at various levels: shot level (small duration), scene level (not-so-small duration, and also represents a semantic unit), segment level (medium duration), multi-segment level (not-so-long duration), and full-length level (long duration). The challenge is to provide the multimedia content annotations at several of these levels. This is addressed by building upon the annotations at lower levels so that the system makes use of all of the available information based on the analysis up to that point and at the same ensuring that the annotations at various levels are consistent with each other. The present invention addresses the issue of providing annotations of a multimedia content at various levels and leading to a better characterization of the multimedia content.