A content system may receive a very large number of content items, such as videos, uploaded from users and other entities. These videos may comprise a variety of different types of content, such as movies and home videos, and may be significantly different in composition, lighting, and other aspects. Many of the videos that are uploaded may in fact be the same video, or be a similar version of the same video. This may present a problem for the content system, as it attempts to organize and categorize the videos, or present them for recommendation to users of the content system. If for example, the content system were to recommend two videos to a user that were actually the same video, the user experience is diminished, and user engagement of the content system may decrease. Thus, it is desirable for the content system to be able to detect the similarity between videos in the content system such that it is able to determine which videos may be duplicates or similar versions of other videos.
However, with the very large number of videos received by the content system, the content system needs an efficient method of determining the similarity between videos. Methods such as analyzing the content of each video in order to determine similarity may require processing power in excess of that which is available to perform such similarity comparison. Furthermore, the varied nature of the composition and content of the videos may cause inaccuracies when attempting to determine video similarity using video content. Thus, a better and more efficient method of determining similarity between videos is lacking and is desired.