Document receiving organizations may receive vast quantities of printed forms from users, such as magazine subscription forms, change of address forms, or generally any forms which may be used to provide information. A received form may include the underlying typographic information of the form and information added to the form by a user. The document receiving organization may generate electronic images of the received forms to facilitate processing the forms. Since the document receiving organization may receive many different types of forms, each of which may be processed differently, the processing of the forms may be expedited if the type of each form can be automatically identified, such as by comparing the electronic image of each form to electronic images of blank forms, or form templates. However, while the received forms may include all or part of a blank form, the received forms may also include one or more variations to the blank form, such as information added to forms by users, facsimile markings, coffee stains, ink smudges, etc. The variations may result in noise which renders image comparison techniques based on pixel and location checking ineffective, thereby requiring the receiving organization to manually identify and/or classify each received form.
Likewise, the growth of user generated content on the Internet may be increasing the occurrence of unauthorized posting of copyrighted content. For example, user generated content, such as “mash-ups”, may often include all or part of copyrighted content, such as songs, images, or video. The unauthorized posting of copyrighted content may be creating a challenging situation for web sites hosting vast quantities of user generated content, such as YouTube™. For example, certain jurisdictions may require that the hosting web sites remove any unauthorized copyrighted content once notified that the copyrighted content has been posted on their site. Thus the hosting sites may be required to monitor user generated content to determine whether the user generated content includes copyrighted content. For example, copyright owners may provide exemplar images, video, or audio files, and the hosting sites may be required to search user generated content for images, video, or audio files which may be similar to the files provided by the copyright holders. However, while the user generated content may include all or part of copyrighted content, the user generated content may also include one or more variations to the copyrighted content. The one or more variations of the user generated content may result in noise which renders direct content comparison techniques, or sampling techniques, ineffective in identifying the copyrighted content. Further variations in the user generated content, such as variations in resolution, sampling rate, or noise, may also prevent the hosting sites from correctly identifying copyrighted content.