Machine learning is used by service provider systems to support a variety of video functionality (e.g., video retrieval, video searching, video recommendations, and so forth) based on a determination of similarity of videos, one to another. A user, for example, may be presented with a list of recommended videos by a service provider system based on videos that were previously viewed by the user. In order to determine this similarity, the service provider system uses machine learning to generate representations of the videos that are comparable, one to another. Consequently, accuracy of the representations in describing respective videos also drives accuracy of the video comparison and resulting recommendations in this example.
Conventional machine learning techniques that are used to generate representations of videos, however, are designed for individual digital images and not videos. For example, conventional machine learning techniques generate a representation of each frame of the video, individually, which are then aggregated to form a representation of the video as a whole. As a result, conventional representations describe content included in individual frames but not a relationship of these frames to each other, e.g., changes to the video that occur in the frames over time. Consequently, convention representations are rather limited to what is described in individual frames.
Further, conventional machine learning techniques result in representations having lengths that are variable based on a number of frames in the video. This is because conventional techniques, as described above, rely on descriptions of content included in each of the individual frames. Thus, videos having increasingly greater numbers of frames also have additional greater representation lengths to describe these videos. This results in increased computational resource usage to generate these representations. Additionally, differences in the length of the representations also introduces complexities in a determination of similarity of the representations to each other by the service provider system and thus also increases computational resource usage.
Additionally, conventional techniques typically encode redundant information between frames of a video due to individual representation of each frame in the video using digital image techniques. For example, consecutive frames of the video may have similar content and thus result in similar redundant representations of that content in conventional representations. This redundant information is of little use in distinguishing one video from another and thus also increases computational resource usage by the service provider system due to comparisons that are forced to address this redundant information. Thus, these limitations in conventional representations of video generated by service provider systems limits an ability of these systems to accurately support functionality that is dependent on these representations, such as video retrieval, video searching, video recommendations, among others.