1. Field of Invention
This invention relates to the identification and retrieval of sequences of digital data (SDDs) by a computing device.
2. Prior Art
A method for the discovery of SDDs that are similar to a target SDD is invented here. Formulae from algebraic topology [Spanier] are used to compute signatures that characterize equivalence classes of SDDs. The method leverages these “equivalence signatures” to find SDDs that are similar to target SDDs and, separately and alternatively, find SDDs that are dissimilar from the target SDDs.
The definition of “similarity”, and thus the features and method used to compute it, is idiosyncratic to the retrieval application [O'Connor]. As an example of a state of the art method to detect similar music, a recent invention uses subjective meta-data in the retrieval of music as its features U.S. Pat. No. 7,022,905 while another U.S. Pat. No. 7,031,980 uses k-means clustering and beat signatures as its features. A third invention U.S. Pat. No. 7,246,314, uses closeness to a Gaussian model as a similarity measure for identifying similar videos. Yet another example U.S. Pat. No. 7,010,515 compares histograms of text elements to determine the similarity of bodies of text. In the case of image retrieval [Gonzalez], methods using entropy, moments, etc. as signatures, have been invented U.S. Pat. Nos. 5,933,823; 5,442,716. Work in computer graphics has advanced these analytical methods by using an elementary result from topology, the Euler number of polyhedra, as a descriptor of boundary polygons of graphics objects [Foley]. Recently, a method for computing the Euler numbers of binary images using a chip design has been invented U.S. Pat. No. 7,027,649.
The cost of implementing these methods is typically proportional to the product of the number of SDDs in the database with the cost of computing the distance between the target SDD and another SDD. The latter often involves the computation of the projection angle between two vectors that represent the features (e.g., histogram of the text elements) of the SDDs. For large databases, this process can be both resource and time expensive. A two step method is required wherein the number of candidates for similarity is significantly reduced in a computationally inexpensive first step and then the traditional features can be applied to the reduced set of candidates.
Intuitively, if two SDDs are similar, then they should be deformable into each other without having to remove or glue together portions of SDDs. For example, in audio applications, if the amplitudes of two subsequences are rescalings of each other or if the phases of the subsequences are shifts of each other, then the subsequences are similar. The field of topology provides a foundation for solving this problem. In particular, we appeal to homotopy invariants that characterize equivalence classes of maps between topological spaces [Bott].
We interpret each SDD as a sampling of maps from an interval of the real line (the world space) to the n-dimensional topological space and seek homotopy equivalence classes of such maps. Following standard techniques, such as adding an extra point to the end of the interval and identifying the value of the map at that point with its value at the first point of the interval, we turn the interval into a circle. As SDDs typically contain defined subsequences (e.g., natural language words or phrases, file section markers, etc.) we take the normalized form of the digital data for each subsequence to be the values of the exponent, φ(i), in the exponential map eiφ(i):S1→S1 for the ith subsequence. We then compute the Fundamental Group, π1(S1), for each map. If two subsequences of digital data do not have the same value of π1(S1), then they cannot be continuously deformed into each other and are thus not similar. If none of the subsequences of two SDDs are similar to each other, then those subsequences are not similar to each other.
The calculation of the equivalence signature consists of two steps. In the first step, the value of π1(S1) for each subsequence of digital data is computed as [Schwarz]
                                                        S                              π                1                                      ⁡                          [                              φ                                  (                  i                  )                                            ]                                =                                    1                              2                ⁢                π                                      ⁢                                          ∫                0                                  2                  ⁢                                                                          ⁢                  π                                            ⁢                                                          ⁢                                                ⅆ                  θ                                ⁢                                                      ⅆ                                          φ                                              (                        i                        )                                                                                                  ⅆ                    θ                                                                                      ,                            Eqn        .                                  ⁢        1            where the world space coordinate, σ, of each data element in the subsequence ofdigital data is used to define the angle on the circle by
      θ    =                  2        ⁢        πσ            L        ,where L is the number of elements in the SDD.
Next, we use the value of π1(S1), for each of the Ns subsequences of the digital data to compute the equivalence signature, ξ[φ], for the entire SDD as:
                                          ξ            ⁡                          [              φ              ]                                ≡                                    ∑                              i                =                1                                            N                s                                      ⁢                                          ξ                i                            ⁡                              [                                  φ                                      (                    i                    )                                                  ]                                                    ,                                  ⁢                                            ξ              i                        ⁡                          [                              φ                                  (                  i                  )                                            ]                                ≡                                    ⅇ                                                -                  ⅈ                                ⁢                                                                  ⁢                                                      πS                                          π                      1                                                        ⁡                                      [                                          φ                                              (                        i                        )                                                              ]                                                                        .                                              Eqn        .                                  ⁢        2            
Consider two SDDs, φ,and φ′, partitioned as {φ(i)} and {φ′(i)}, respectively. By construction, if a subsequence of digital data, φ(i), is similar to another subsequence of digital data, φ′(i), by the addition of a third SDD, α(i), so thatφ(i)→φ′(i)=φi+α(i),   Eqn. 3then as long as the values of the α(i) at the endpoints are the same for each partition, then the difference in the values of the equivalence signatures will be the same: ξ[φ′]=ξ[φ]. If on the other hand, the values of the α(i) at the endpoints are not the same for each partition, then the difference in the values of the the equivalence signatures is bounded by the number, Nδ, of subsequences that are different:−Nδ≦(ξ[φ′]−ξ[φ])≦Nδ.   Eqn. 4
As an example for the reduction factor for the number of CPU cycles and other resources required to find similar SDDs in a corpus, assume for simplicity that Nδ=0 and that the equivalences signatures of the SDDs in the corpus are uniformly distributed over their possible values. Then the reduction in the number of secondary features to be compared is
      (          1                        N          S                +        1              )    .Thus for a corpus of text documents with ten words per sentence on the average, wherein we are interested in finding a text documents that contain the words that are in a target sentence, irrespective of the ordering of the words, we will have roughly a factor of ten reduction in the number of secondary feature comparisons as compared to the state of the art. In particular, without the use of the method invented here                in the case where a term vector is used as the characteristic feature of each SDD, the term vector of the target would have to be compared to all term vectors computed for the SDDs in the corpus, or        in the case where a cryptographic hash is used as the characteristic feature of each SDD, there would be a hash for each of the possible Ns! orderings of the words in the target SDD and a check for the equality of each of these hashes with the hash of each SDD in the corpus, would have to be done.        
In this case, the method invented here reduces the number of executions of these computations by the aforementioned factor.