In various applications it may be desirable to find similarities between a new or sample media file and one or more other media files. For example, such a comparison may be useful in identifying copyrighted audio or video files, such as in the context of a website that allows users to upload such media files, to identify potential infringement. To do so, some techniques use hashes that characterize the media files or portions of the media files, for example at or near a given time point within the media file. Such techniques may be limited as the number of known media files grows, or as media files become larger. For example, as the number of media files grows, a high number of matching hashes may be retrieved for a given portion of a media file. However, many of the retrieved hashes may not represent true matches to the given media file segment. It may be difficult to identify desirable hashes to use for the comparison. For example, some media files may include segments that result in hashes that are often matched to new media files, but that do not necessarily indicate a good match between the content in those files.