Measuring a level or degree of similarity between objects can be very useful in many applications. Image similarity is performed based upon the images' inherent color and texture properties. 3D objects can be compared based on shape matching algorithms that consider topology and feature matching. Textual content can be matched by metrics ranging from a simple “diff” program to more advanced pattern matching and semantic grouping algorithms.
Although many systems and mechanisms are known for determining and measuring levels of similarity among XML documents, computing similarity among SVG documents is not nearly so simple. SVG content is based upon an underlying XML format. The fundamental difficulty with SVG documents that two given SVG documents with a very similar underlying XML representation may have completely different visual representations when rendered, and vice-versa. FIGS. 1(a) and 1(b) exemplify this issue. Although the two SVG documents in FIGS. 1(a) and 1(b) look identical, their underlying textual representations are quite different from one another. In particular, FIG. 1(a) makes use of <defs> and <use> elements for predefining the shapes and reusing them with different colors and positions. FIG. 1(b), on the other hand, renders each shape separately without reusability. If a system relies on traditional document comparison methods to determine the similarity between these documents, the documents might be classified as being vastly different. In addition, traditional pixel-based methods for determining levels of similarity are not optimal, as one would have to convert the SVG-rendered content into raster graphics, and the process becomes even more complicated when animations are involved.
FIGS. 2(a) and 2(b), on the other hand, show a situation where the SVG textual contents are similar to one another, but the documents themselves possess very different visual appearances. The only difference between the two documents in terms of the underlying SVG text is “style=“visibility:hidden.” However, this small difference makes the ultimate images look quite different visually.
Although SVG is considered a promising XML-based language for 2D graphics, potentially opening up a whole host of possibilities for new consumer and enterprise services, there has thus far been relatively little progress in optimizing SVG in these different applications.
Several methods and tools have been previously developed for computing similarity among XML documents. For example, one tool called “XML Diff” detects structural changes in the XML sub-trees and produces a Diffgram to describe the differences between the two sub-trees. A second method involves the use of a matching algorithm for measuring the structural similarity between an XML document and a DTD. A third approach involves linearizing the structure of each XML document by representing it as a numerical sequence, and then comparing the sequences through the analysis of their frequencies. A fourth approach involves a structural similarity metric for XML documents based upon an “XML aware” edit distance between ordered labeled trees to cluster documents by DTD. A fifth method measures similarity between vectors after representing documents based upon their structure in vector form. This method is used to obtain the measure of structural similarity between two given documents and is discussed in United States Application Publication No. 2005/0038785. However, none of these systems focus on the problem of differences in the underlying content and the visual representation, as XML by itself is not visual and SVG is a special form of XML content.
In addition to the above, there are also several methods for compressing XML content based upon certain optimizations for removing redundant patterns. One such system involves a new XML compression scheme that is based upon the Sequitur compression algorithm to remove excessive information redundancy in its representation. By organizing the compression result as a set of context free grammar rules, the scheme supports processing of XPath queries without decompression. Another approach involves a tool for compressing XML data, with applications in data exchange and archiving, which usually achieves about twice the compression ratio of gzip at roughly the same speed. The compressor, referred to as XMill, incorporates and combines existing compressors in order to apply them to heterogeneous XML data. XMill uses zlib, the library function for gzip, a collection of datatype specific compressors for simple data types, as well as user-defined compressors for application specific data.