1. Field of Invention
This invention relates to the identification and retrieval of digital data by a computing device.
2. Prior Art
A method for the discovery of a set of digital data (SDD), such as text, binaries, audio channels, and the like, that are organized for point-wise presentation in one-dimension, that are similar to a target SDD, is invented here. Formulae for the lengths of the paths swept out by the data are used as signatures that characterize equivalence classes of SDDs with the same or numerically close data. The method leverages these “equivalence signatures” to find SDDs that are similar to target SDDs and, separately and alternatively, find SDDs that are dissimilar from the target SDDs.
The definition of “similarity”, and thus the features and method used to compute it, is idiosyncratic to the retrieval application [O'Connor]. In the case of image retrieval [Gonzalez], methods using entropy, moments, etc. as signatures, have been invented [U.S. Pat. Nos. 5,933,823; 5,442,716]. Another invention [U.S. Pat. No. 7,246,314], uses closeness to a Gaussian model as a similarity measure for identifying similar videos.
The cost of implementing these methods is typically proportional to the product of the number of SDDs in the database with the cost of computing the distance between the target SDD and another SDD. The latter often [Raghavan] involves the computation of the projection angle between two vectors that represent the features (e.g., histogram of the text elements) of the SDDs. For large databases, this process can be both resource and time expensive. A two step method is required wherein, during the retrieval phase, definitely dissimilar SDDs are first weeded out thereby significantly reducing the number of candidates for similarity. This first step should be computationally inexpensive thus significantly reducing the resource requirements and latency in computing the results of the second step, the application of traditional features.
Intuitively, if two SDDs are similar, then they should be locally deformable into each other. For example, if two audio channels are rescalings of each other, then the audio channels are similar. This invention leverages results from Classical Mechanics to address this problem. In particular, we appeal to field theory representations for the lengths of curves swept out by the SDD when stepping through the presentation space. By construction, these lengths are invariant under reparameterizations of the presentation space and thus characterize equivalence classes of length preserving maps between the presentation and data spaces.
We interpret each SDD as a sampling of maps from a one-dimensional space, N, with coordinate, (θ) to an m-dimensional space, M, with coordinates σA(θ), for A=1, . . . , m and seek length preserving equivalence classes of such maps. We label the length of the presentation space dimension as L.
Let the raw data, {tilde over (σ)}A(θ), of each SDD be organized into m data planes, e.g., two PCM channels of stereo audio, for presentation and let each plane have a maximum and minimum value for the data in that plane, {tilde over (σ)}maxA and {tilde over (σ)}minA, respectively. The maximum and minimum values of each of the two planes are used to normalize their data to new minimum and maximum values, σmaxA and σminA respectively, through the expressions:
                                          σ            A                    ⁡                      (            θ            )                          =                                            [                                                                    σ                    max                    A                                    -                                      σ                    min                    A                                                                                                              σ                      ~                                        max                    A                                    -                                                            σ                      ~                                        min                    A                                                              ]                        ⁡                          [                                                                                          σ                      ~                                        A                                    ⁡                                      (                    θ                    )                                                  -                                                      σ                    ~                                    max                  A                                            ]                                +                      σ            max            A                                              Eqn        .                                  ⁢        1            
Additional normalizations of the SDD, such as scaling to a fixed length and the like, may also be performed.
If objects have been segmented from the SDD then the data for these objects are themselves SDDs. We henceforth refer to each segmented portion as a “SDD section” with its own map, σ.
For SDD sections for one-dimensional presentation, such as text, binaries, audio, and the like, the expression for the equivalence signature resolves to the length of the path represented by the data [Abraham], namely
                              ξ          ⁡                      [            σ            ]                          =                              ∫            0            L                    ⁢                                    ⅆ              θ                        ⁢                                                            ∑                                      A                    =                    1                                    m                                ⁢                                                                            ⅆ                                              σ                        A                                                                                    ⅆ                      θ                                                        ⁢                                                            ⅆ                                              σ                        A                                                                                    ⅆ                      θ                                                                                                                              Eqn        .                                  ⁢        2            
SDD sections that have the same value for the equivalence signature will belong to the same equivalence classes under:                reparametrizations of the presentation space (e.g., jumbling of the characters in text)        resealing of the presentation space        offsets in the data values (e.g., change in the amplitudes of audio PCM data)        replacing the data values with their mirrored values        global, orthogonal rotations of the planes into each otherseparately and collectively. Proofs of these symmetries are recounted in works such as Ref. [Abraham].        
Consider two SDD sections, σ1A(θ) and σA(θ) such that at each point, the difference between the values of the maps is εA(θ),εA(θ)=σ1A(θ)−σA(θ)  Eqn. 3
For the two SDD sections to be similar we take εA(θ) to be small compared with σA(θ) so that terms of order ε2(θ) can be neglected. With this as a quantitative measure of similarity, we can assign bounds on the differences of the equivalence signatures via the functional difference:Δξ[σ;ε]≡|ξ[σ+ε]−ξ[σ]|  Eqn. 4
As εA(θ) is small, to a first approximation, Δξ[σ;ε] is a linear functional of εA. We will exploit this henceforth.
For example, suppose we are interested in finding audio channels the data values of whose amplitudes differ by no more than P percent at each sample, then εA(θ)=pσA(θ) are used in the computation of Δξ[σ;ε]. Retrieval of similarity candidates proceeds by finding those audio channels with values of ξ[σ], denoted as ξ[σsimilar], for which the following inequalities hold:|ξ[σtarget]−ξ[σsimilar]|≦|Δξ[σtarget;ε]|  Eqn. 5
As an example for the reduction factor for the number of CPU cycles and other resources required in finding similar sections of SDDs in a corpus, assume for simplicity that the equivalences signatures of the SDD sections in the corpus are uniformly distributed in [ξmax,ξmin]. If for a target SDD section, the choice of similarity leads to Δξ[σ;ε], the reduction in the number of secondary features to be compared is
                              f          r                =                              (                                          2                ⁢                                                                        Δξ                    ⁡                                          [                                              σ                        ;                        ɛ                                            ]                                                                                                    +              1                        )                                (                                          ξ                max                            -                              ξ                min                            +              1                        )                                              Eqn        .                                  ⁢        6            
In state of the art information retrieval methodologies, the feature vector which is used for each SDD section would have to be compared to all Nc feature vectors computed for the SDD sections in the corpus. Upon employing the method invented here as a precursor to the feature vector comparison, the number of feature vectors to be compared would be reduced to frNc.