This application is directed to finding document content in a large mixed-type document collection containing rich document content using two-dimensional (2D) visual fingerprints.
Modern rich document content applications such as Microsoft Word and PowerPoint incorporate different object types that can be individually manipulated. A rich content page may be composed of the plurality of different objects such as text, line-art, and photograph content, for example. These rich document content applications usually provide a convenient set of tools for accessing individual objects and for editing and re-positioning content relative to other objects on the page.
The reuse of previous document material through cut and paste and repositioning of objects on the page is a widespread common practice in creating rich presentations content with applications such as Microsoft PowerPoint and page layout programs.
This object identity, however, is lost when objects are rendered to produce a series of page images for visual two-dimensional fingerprinting. When rendering page objects for such two-dimensional fingerprinting, the entire page content is “flattened” first to form an image, which is then fingerprinted by computing visual fingerprints from local geometry or image appearance properties. In consequence, some of the resulting fingerprints may “bind together” local properties of different types of objects that happen to reside in close local proximity.
For example, consider a page composed of a photo image and a closely placed text caption directly below. A visual fingerprint near the bottom boundary of the image may involve a local neighborhood whose upper portion contains a part of the photo, and lower portion contains part of the caption text. Thus the resulting fingerprint is likely to mix together (a) photo properties with (b) text properties in its appearance.
The problem with hybrid mixed-content visual fingerprints is that they are rather unforgiving to minor local changes such as a user selecting and moving a text caption object closer or further away from an associated photo object. The resulting mixed-content fingerprints of a modified page are likely to be different than the original fingerprints due to the resulting change in local appearance. The mixed-content fingerprints require the precise visual alignment between unrelated object types.
The hybrid mixed-content fingerprint behavior is at odds with the user expectation, which is that unless there is a substantial change, a user is still likely to think of a modified page with a slightly different distance between a photo and a text caption objects as being composed of essentially the same content.
Therefore unless the content being sought is purely identical, hybrid mixed-content in rich documents should be avoided when employing existing two-dimensional visual fingerprinting. The disclosure of the present application addresses this issue, disclosing a method and system which provides for the detection of hybrid mixed rich content.