1. Field of the Invention
This invention relates to the identification and retrieval of digital data by a computing device.
2. Prior Art
A method for the computation finding similar sets of digital data (SDD), such as text, binaries, audio, images, graphics, video, and the like, that have been assigned equivalence signatures, is invented here. The determination of similarity leverages a metric, on the space of equivalence signatures, that manifests isometrics of the space of signatures. By design, similar sets of data will have the same lengths on the space of equivalence signatures.
The definition of “similarity”, and thus the features and method used to compute it, is idiosyncratic to the retrieval application [O'Connor]. In the case of image retrieval [Gonzalez], methods using entropy, moments, etc. as signatures, have been invented [U.S. Pat. Nos. 5,933,823; 5,442,716]. Another invention [U.S. Pat. No. 7,246,314], uses closeness to a Gaussian model as a similarity measure for identifying similar videos. Methods for the assignment of signatures for all media types which change in a bounded manner when small alterations are made to a SDD have also been invented [Brooks]. The latter are also invariant to certain classes of transformations of the data, which are typical of the changes between similar sets of data, leading to equivalence classes of sets of data. Thus those signatures are referred to as “equivalence signatures”.
The cost of implementing these methods is typically proportional to the product of the number of SDDs in the database with the cost of computing the distance between the target SDD and another SDD. The latter often [Raghavan] involves the computation of the projection angle between two vectors that represent the features (e.g., histogram of the text elements) of the SDDs. For large databases, this process can be both resource and time expensive. A two step method is required wherein, during the retrieval phase, definitely dissimilar SDDs are first weeded out thereby significantly reducing the number of candidates for similarity. This first step should be computationally inexpensive thus significantly reducing the resource requirements and latency in computing the results of the second step, the application of traditional features.
Intuitively, if two SDDs are similar, then they should be locally deformable into each other. For example, if two images are rescalings of each other, then they are similar. This invention leverages elementary results from the differential geometry of symmetric spaces [Helgason] to address this problem. In particular, we use the fact that the equivalence signatures change by diffeomorphisms when infinitesimal changes, that do not leave the equivalence signatures invariant, are made to the SDDs. A class of metrics for which these diffeomorphisms are isometries is used to compute the distance on the space of signatures.
We interpret each SDD as a sampling of maps from an n-dimensional space, N, with coordinates, (θ1,θ2 . . . ,θn) or collectively θ, to an m-dimensional space, M, with coordinates σA(θ), for A=1, . . . , m. Each σA(θ) is referred to as a plane; for example, images typically have three color planes of data, red, green and blue. If objects have been segmented from the SDD then the data for these objects are themselves SDDs.
For each SDD, a point on the space of equivalence signatures is computed. Each equivalence signature ξ[σ], is a functional of the SDD. Consider an infinitesimal variation of a SDD,σA(θ)→σ1A(θ)=σA(θ)+ερA(θ),   Eqn. 1where the σ1A are the planes of the SDD formed as result of the changes to the original SDD's planes σA, ε is an infinitesimal constant such that contributions of order ε2 can be neglected, and the ρA(θ) are functions which represent the changes to the original SDD at each point in the presentation space. Under these changes, the equivalence signatures transform into functions on the space of equivalence signatures,ξα[σ+ε]=ξα[σ]+εƒα(ξ[σ]),   Eqn. 2
Here α is a label for the coordinates in the r-dimensional space of equivalence signatures, α=1, . . . r. The εα[σ] are points on the space of equivalence signatures. The vectors to those points from the origin of the latter space, are equivalence signature vectors.
As an example, take the equivalence signature for the dynamics of data in two-dimensional presentation spaces [Brooks] with (summation over repeated indices is implied)                width and height L1 and L2, respectively,        a configurable metric, gij(θ), on the presentation space,        g being the determinant of the metric gij         a configurable metric, GAB(σ), on the data space        
      ∂    i    ⁢      =          ∂              ∂                  θ          i                    an i=1,2ε[σ]≡∫0L1dθ1∫0L2dθ2√{square root over (g)}gijGAB(σ)∂iσA∂jσB.   Eqn. 3
Note that different choices for the metric, GAB, lead to different equivalence signatures for the same SDD. Applying the change in Eqn. 1, this equivalence signature changes to an expression of the form given in Eqn. 2, with the function ƒ given by its argument, ƒ(ξ[σ])=ξ[σ], but with a new metric on the data space as GAB(σ) is replaced by
                                                        G              AB                        ⁡                          (              σ              )                                →                                    1              2                        ⁡                          [                                                                                                                                                                        G                            AC                                                    ⁡                                                      (                            σ                            )                                                                          ⁢                                                                              ∂                            E                                                    ⁢                                                                                    ρ                              C                                                        ⁡                                                          (                              σ                              )                                                                                                                          +                                                                                                    G                            BC                                                    ⁡                                                      (                            σ                            )                                                                          ⁢                                                                              ∂                            A                                                    ⁢                                                                                    ρ                              C                                                        ⁡                                                          (                              σ                              )                                                                                                                          +                                                                                                                                                                                    ρ                          C                                                ⁡                                                  (                          σ                          )                                                                    ⁢                                                                        ∂                          C                                                ⁢                                                                              G                            AB                                                    ⁡                                                      (                            σ                            )                                                                                                                                                          ]                                      ,                                  ⁢                              where            ⁢                                                  ⁢                          ∂              A                                =                                    ∂                              ∂                                                      σ                    A                                    ⁡                                      (                    θ                    )                                                                        .                                              Eqn        .                                  ⁢        4            
The transformation in Eqn. 2 is a diffeomorphism of the space of signatures. If we have a metric on the space of equivalence signatures that is preserved by such a diffeomorphism, then the distances of the points in that space that differ by Eqn. 1 will be the same. Distances in Euclidean space are preserved under the full set of isometries of the space. Thus SDDs, whose equivalence signatures are at points in the space of equivalence signatures that are equidistance from the origin of that space, are similar. Euclidean spaces are the best known example of maximally symmetric spaces. The metrics of, and hence distances on, maximally symmetric spaces are preserved under the full set of isometrics of the space. Here we parameterize such metrics in terms a constant, K, and a constant r×r matrix Kαβ as [Weinberg]
                                          C            αβ                    ⁡                      (                          ξ              ⁡                              [                σ                ]                                      )                          ≡                              K            αβ                    +                                                    KK                γα                            ⁢                                                ξ                  V                                ⁡                                  [                  σ                  ]                                            ⁢                              K                δβ                            ⁢                                                ξ                  δ                                ⁡                                  [                  σ                  ]                                                                    1              -                                                KK                  μτ                                ⁢                                                      ξ                    α                                    ⁡                                      [                    σ                    ]                                                  ⁢                                                      ξ                    τ                                    ⁡                                      [                    σ                    ]                                                                                                          Eqn        .                                  ⁢        5            
For simplicity we will choose Kαβ=δαβ then the case K=0 yields Euclidean space. In terms of the metric Cαβ the distance, in the space of equivalence signatures is
                              D          =                      ∫                                          ⅆ                τ                            ⁢                                                                                          C                      αβ                                        ⁡                                          (                                              ξ                        ⁡                                                  [                          σ                          ]                                                                    )                                                        ⁢                                                            ⅆ                                                                        ξ                          α                                                ⁡                                                  [                          σ                          ]                                                                                                            ⅆ                      τ                                                        ⁢                                                            ⅆ                                                                        ξ                          β                                                ⁡                                                  [                          σ                          ]                                                                                                            ⅆ                      τ                                                                                                          ,                            Eqn        .                                  ⁢        6            in terms of the a dummy path parameter τ. This distance is invariant under a
      r    ⁡          (              r        +        1            )        2parameter group of isometrics (see Ref. [Weinberg] for examples):                rigid rotations of the signatures about the origin in signature space, and        restricted local translations of the signatures.        
Let the SDDs in a corpus be such that their points in the space of equivalence signatures are uniformly distributed. Furthermore, let the distances of those points from the origin fall between Dmax and Dmin. Then the number of candidate similar SDDs is reduced down from the size of the corpus by a factor of
      1          1      +              D        max            -              D        min              .Similar SDDs lie on the surface of a sphere whose radius is the length of any representative member of the set of similar SDDs.