1. Field of Art
The invention generally relates to an indexing system and more specifically to improving the efficiency of lookup and retrieval of videos from a stored index.
2. Description of the Related Art
Online video hosting services may contain thousands or millions of video files, making management of these libraries an extremely challenging task. The challenges become particularly significant in the case of online video hosting services where many users can upload video content for viewing by others. In order to provide efficient management of video hosting services, search engines have been developed to enable a user to determine whether an input item of video content matches reference video content in a large video database.
To facilitate searching of the large video database, the reference videos may be indexed into a searchable reference index of lookup keys generated based on the reference videos. Each lookup key in the index is associated with a set of reference videos which contain data corresponding to the lookup key. When an input video is received by a search engine, a set of lookup keys are generated for the input video and used to search the index in order to identify reference videos (or portions of reference videos) that have characteristics in common with the input video. Based on the retrieved information, one or more reference videos (or portions of reference videos) can be matched to the input video.
Problems in the efficiency of this matching occur when an index lookup performed by the search engine returns a very large list of reference videos associated with a particular lookup key. “Clumping” is a statistical term used to describe an instance in which a lookup key is associated with a number of reference videos which is significantly larger than the number of reference videos associated with the other keys in the reference index or a threshold value defined, for example, by an administrator of a system. Clumping is caused by a non-uniform distribution of the number of reference videos associated with the set of lookup keys. Clumping occurs when a key is generated based on data that is associated with a large number of the indexed items, for example data which contains a feature that is commonly found in the population of indexed items. For example, a lookup key generated based on a portion of a reference video which shows an image or soundtrack will create clumping if the image or soundtrack is prevalent in the population of reference videos being searched. Accordingly, instances of clumping associated with lookup keys are specific to the population of indexed items. An “amount of clumping”, as used herein, refers to the relative number of reference identifiers associated with the lookup key or a set of lookup keys. A lookup key that is associated with a large number of reference identifiers is associated with a large amount of clumping and a lookup that that is associated with a small number of reference identifiers is associated with a small amount of clumping.
As multiple index lookups are necessary to retrieve a matching item, if several lookup keys associated with an input video will produce large lists of matching results, a very large list of matching results may be created. In such a situation, the system may be unable to handle the large data flow of matching results due to constraints such as processing power, required retrieval times, memory, or network bandwidth. Due to these limitations, an efficient lookup and retrieval system that minimizes clumping while maintaining the accuracy of the matching process is needed.