1. Field of the Invention
This invention relates to the identification and retrieval of digital data by a computing device.
2. Prior Art
A method for the discovery of a set of digital data (SDD), such as text, binaries, audio channels, and the like, that are organized for point-wise presentation in one-dimension, that are similar to a target SDD, is invented here. Formulae for the dynamics of the paths swept out by the data are used as signatures that characterize equivalence classes of SDDs with the same or numerically close data. The method leverages these “equivalence signatures” to find SDDs that are similar to target SDDs and, separately and alternatively, find SDDs that are dissimilar from the target SDDs.
The definition of “similarity”, and thus the features and method used to compute it, is idiosyncratic to the retrieval application [O'Connor]. In the case of image retrieval [Gonzalez], methods using entropy, moments, etc. as signatures, have been invented [U.S. Pat. Nos. 5,933,823; 5,442,716]. Another invention [U.S. Pat. No. 7,246,314], uses closeness to a Gaussian model as a similarity measure for identifying similar videos.
The cost of implementing these methods is typically proportional to the product of the number of SDDs in the database with the cost of computing the distance between the target SDD and another SDD. The latter often [Raghavan] involves the computation of the projection angle between two vectors that represent the features (e.g., histogram of the text elements) of the SDDs. For large databases, this process can be both resource and time expensive. A two step method is required wherein, during the retrieval phase, definitely dissimilar SDDs are first weeded out thereby significantly reducing the number of candidates for similarity. This first step should be computationally inexpensive thus significantly reducing the resource requirements and latency in computing the results of the second step, the application of traditional features.
Intuitively, if two SDDs are similar, then they should be locally deformable into each other. For example, if two audio channels are rescalings of each other, then the audio channels are similar.
This invention leverages results from Classical Mechanics [Abraham] and the differential geometry of symmetric spaces [Helgason] to address this problem. In particular, we appeal to field theory representations for the functional for the motion of a point-particle in the space swept out by the SDD when stepping through the presentation space. By construction, these lengths are invariant under reparameterizations of the presentation space and thus characterize equivalence classes of length preserving maps between the presentation and data spaces.
We interpret each SDD as a sampling of maps from a one-dimensional space, N, with coordinate, (θ) to an m-dimensional space, M, with coordinates σA(θ), for A=1, . . . m and seek length preserving equivalence classes of such maps. We label the length of the presentation space dimension as L.
Let the raw data, {tilde over (σ)}A(θ), of each SDD be organized into m data planes, e.g., two PCM channels of stereo audio, for presentation and let each plane have a maximum and minimum value for the data in that plane, {tilde over (σ)}maxA and {tilde over (σ)}minA, respectively. The maximum and minimum values of each of the two planes are used to normalize their data to new minimum and maximum values, σmaxA and σminA respectively, through the expressions:
                                          σ            A                    ⁡                      (            θ            )                          =                                            [                                                                    σ                    max                    A                                    -                                      σ                    min                    A                                                                                                              σ                      ~                                        max                    A                                    -                                                            σ                      ~                                        min                    A                                                              ]                        ⁡                          [                                                                                          σ                      ~                                        A                                    ⁡                                      (                    θ                    )                                                  -                                                      σ                    ~                                    max                  A                                            ]                                +                      σ            max            A                                              Eqn        .                                  ⁢        1            
Additional normalizations of the SDD, such as scaling to a fixed length and the like, may also be performed.
If objects have been segmented from the SDD then the data for these objects are themselves SDDs. We henceforth refer to each segmented portion as a “SDD section” with its own map, σ.
The equivalence signature is the functional for the motion of a point-particle in a space with metric GAB(σ): [Weinberg]:
                              ξ          ⁡                      [            σ            ]                          ≡                              ∫            0            L                    ⁢                                          ⁢                                    ⅆ                              θⅇ                                  -                  1                                                      ⁢                                          ∑                                  A                  =                  1                                m                            ⁢                                                                    G                    AB                                    ⁡                                      (                    σ                    )                                                  ⁢                                                      ⅆ                                          σ                      A                                                                            ⅆ                    θ                                                  ⁢                                                      ⅆ                                          σ                      B                                                                            ⅆ                    θ                                                                                                          Eqn        .                                  ⁢        2            where we are free to choose the GAB(σ) as any metric for the data space as well as the einbein, e(θ), on the presentation space. Once the choice of the metric is made, however, the chosen metric must be used in all computations of equivalence signatures that are to be compared to deduce the degree of similarity of their respective data. The choice of metric used in the primary embodiment of this invention is defined in terms a constant, K, and a constant m×m matrix CAB as
                                          G            AB                    ⁡                      (            σ            )                          ≡                              C            AB                    +                                                    KC                CA                            ⁢                              σ                C                            ⁢                              C                DB                            ⁢                              σ                D                                                    1              -                              K                ⁢                                                                  ⁢                                  C                  EF                                ⁢                                  σ                  E                                ⁢                                  σ                  F                                                                                                              Eqn            .                    ⁢                                                ⁢        3            
For simplicity we will later choose CAB=δAB and consider the cases where K=0 as well as K=−1. We will also take e(θ)=1 thus making the presentation space Euclidean.
Consider two SDD sections, σ′A(θ) and σA(θ) such that at each point, the difference between the values of the maps is εA(θ),εA(θ)=σ′A(θ)−σA(θ)  Eqn. 4
For the two SDD sections to be similar we take εA(θ) to be small compared with σA(θ) so that terms of order ε2(θ) can be neglected. With this as a quantitative measure of similarity, we can assign bounds on the differences of the equivalence signatures via the functional difference:Δξ[σ; ε]≡|ξ[σ+ε]−ξ[σ]|  Eqn. 5
As εA(θ) is small, to a first approximation, Δξ[σ; ε] is a linear functional of εA. We will exploit this henceforth. For example, suppose we are interested in finding audio channels the data values of whose amplitudes differ by no more than P percent at each sample, then εA(θ)=pσA(θ) are used in the computation of Δξ[σ; ε] Retrieval of similarity candidates proceeds by finding those audio channels with values of ξ[σ], denoted as ξ[σsimilar], for which the following inequalities hold:|ξ[σtarget]−ξ[σsimilar]|≦Δξ[σtarget; ε]  Eqn. 6
As an example for the reduction factor for the number of CPU cycles and other resources required in finding similar sections of SDDs in a corpus, assume for simplicity that the equivalences signatures of the SDD sections in the corpus are uniformly distributed in [ξmax, ξmin]. If for a target SDD section, the choice of similarity leads to Δξ[σ; ε], the reduction in the number of secondary features to be compared is
                              f          r                =                              (                                          2                ⁢                                  Δξ                  ⁡                                      [                                          σ                      ;                      ɛ                                        ]                                                              +              1                        )                                (                                          ξ                max                            -                              ξ                min                            +              1                        )                                                                    Eqn            .                    ⁢                                                ⁢        7            
In state of the art information retrieval methodologies, the feature vector which is used for each SDD section would have to be compared to all Nc feature vectors computed for the SDD sections in the corpus. Upon employing the method invented here as a precursor to the feature vector comparison, the number of feature vectors to be compared would be reduced to ∫ξ,Nc.
SDD sections that have the same value for the equivalence signature will be related by                A. rigid translations and rotations within the presentation space        B. reparameterizations of the presentation space,        C. reversing the signs of the data values,        D. rigid rotations of the σA into each other about the origin        E. local translations in the data space of the formεA(σ)=εA√{square root over ((1−KCCDσCσD))}  Eqn. 8separately and collectively. Proofs of the invariance of the functional in Eqn. 2 under these symmetries are recounted in works such as Ref. [Weinberg]. For certain types of data, a subset of these symmetries are required for similarity whereas the remaining symmetries account for the presences of non-similar data with the same values for the equivalence signatures; i.e., false positives. For example, for audio, we would like to include as part of the realization of similarity so as to account for different linear combinations of the audio channels.        