It has long been pointed out that neural networks can compute mappings of the kind ƒ:uv, where u and v are usually interpreted as activity pattern vectors of the presynaptic and postsynaptic neuron populations, respectively. Actually, it has been proven that multilayer neural networks can approximate arbitrary functions ƒ with arbitrary precision (although this may require a very large neuron number in the hidden layers). Neural associative memory networks are a special case where the mapping is between discrete sets of address patterns uμ and content patterns vμ, μ=1, . . . , M. Given a noisy address pattern ũ, the usual memory task is then to find the most similar uμ minimizing ∥ũ−uμ∥, and returning the corresponding content vμ. This task is essentially the best match or nearest-neighbor problem and can be extended to implement again continuous mappings ƒ by interpolating between several sample points. Efficient solutions to the best match problem have widespread applications such as object recognition, clustering or information retrieval in large databases.
STORING AND RETRIEVING PATTERNS IN THE WILLSHAW MODEL An attractive model of neural associative memory both for biological modeling and applications is the so-called Willshaw or Steinbuch model with binary neurons and synapses, illustrated in FIG. 1. Each address pattern uμ is a binary vector of length m that, on average, contains k one-entries and m−k zero-entries. Similarly, each target pattern vμ is a binary vector of length n that contains l one-entries and n−l zero-entries. Typically, the patterns are sparse, i.e., k<<m and l<<n.
FIG. 1 shows the working principle of the binary Willshaw model (for hetero-association). On the left hand side, during the learning phase, M associations between address patterns uμ and content patterns vμ are stored in the binary memory matrix A representing binary synaptic weights of the connection from address population u to content population v. Initially all synapses are inactive. During learning of pattern associations, the synapses are activated according to Hebbian coincidence learning (eq. 1). On the right hand side, for retrieval, an address pattern ũ is propagated through the network. Vector-matrix-multiplication yields the membrane potentials x=ũA. To obtain the retrieval result {circumflex over (v)} (here equal to v1) a threshold Θ is applied. For pattern part retrieval, ũ⊂uμ one may simply choose the Willshaw threshold Θ=|ũ| yielding a superset {circumflex over (v)}⊃vμ of the original pattern (i.e., no missing one-entries).
The M pattern pairs are stored hetero-associatively in a binary memory matrix Aaε{0, 1}m×n, where
                              A                      a            ,            ij                          =                              min            ⁡                          (                              1                ,                                                      ∑                                          μ                      =                      1                                        M                                    ⁢                                                            u                      i                      μ                                        ·                                          v                      j                      μ                                                                                  )                                ∈                                    {                              0                ,                1                            }                        .                                              (        1        )            The model may slightly be extended by allowing noisy one-entries and zero-entries in the matrix. Let Ã1 and Ã0 be two binary random m×n matrices, where each matrix entry is one independently with probability {tilde over (p)}1 and {tilde over (p)}0, respectively. Interpreting Ã1 as the set of noisy one-synapses, and Ã0 as the set of noisy zero-synapses, the resulting synaptic matrix isA=(AaA1)Ã0ε{0,1}.  (2)
After learning, the stored information can be retrieved applying an address pattern ũ. Vector-matrix-multiplication yields the neural potentials x=ũA of the target population, and imposing a threshold Θ gives the (one-step) retrieval result {circumflex over (v)},
                                          v            ^                    j                =                  {                                                                                          1                    ,                                                                                                              x                      j                                        =                                                                  (                                                                              ∑                                                          i                              =                              1                                                        m                                                    ⁢                                                                                                                    u                                ~                                                            i                                                        ⁢                                                          A                              ij                                                                                                      )                                            ≥                      Θ                                                                                                                                        0                    ,                                                                    otherwise                                                      .                                              (        3        )            A critical prerequisite for high retrieval quality is the right choice of the threshold value Θ: Too low values will lead to high rates of false one-entries (add-errors) whereas too high values will result in high rates of false zero-entries (miss-errors). This threshold choice can be simple or rather difficult, depending on the types of errors present in the address. In the following c and ƒ denote the number of correct and false one-entries in the address, respectively.
For {tilde over (p)}0=0 and error-free addresses (c=k and ƒ=0) or pattern part retrieval, that is, when the address contains miss errors only (c=λk and ƒ=0 with 0<λ<1) the optimal threshold value is a simple function of the address pattern Θ=|ũ|. This threshold value was used in the original Willshaw model and therefore it will be referred to as the Willshaw threshold. A detailed description of the Willshaw model may be found in the original article by D. J. Willshaw et al. (“Non-holographic associative memory”, Nature, 222:960-962, 1969).
MEMORY LOAD First the memory load p1 is determined, that is, the fraction of one-entries in the memory matrix. The probability that a synapse gets activated during learning of M pattern pairs is
                                          p                          1              ⁢              a                                =                                    1              -                                                (                                      1                    -                                          kl                      mn                                                        )                                M                                      ≈                          1              -                              ⅇ                                                      -                    Mkl                                    /                  mn                                                                    ,                            (        4        )            and therefore, considering the noisy synapses, the probability that a synapse is active is
                              p          1                =                                            1              -                                                p                  ~                                0                            -                                                (                                      1                    -                                                                  p                        ~                                            0                                                        )                                ⁢                                  (                                      1                    -                                                                  p                        ~                                            1                                                        )                                ⁢                                  (                                      1                    -                                          p                                              1                        ⁢                        a                                                                              )                                                      ⁢                                                  ⁢                                                  ≥                          p                              1                ,                min                                              :=                      1            -                                          p                ~                            0                        -                                          (                                  1                  -                                                            p                      ~                                        0                                                  )                            ⁢                              (                                  1                  -                                      p                    1                                                  )                                                                        (        5        )                                          p                      1            ⁢            a                          =                  1          -                                    1              -                                                p                  ~                                0                            -                              p                1                                                                    (                                  1                  -                                                            p                      ~                                        0                                                  )                            ⁢                              (                                  1                  -                                                            p                      ~                                        1                                                  )                                                                        (        6        )                                M        =                                            ln              ⁢                                                1                  -                                                            p                      ~                                        0                                    -                                      p                    1                                                                                        (                                          1                      -                                                                        p                          ~                                                0                                                              )                                    ⁢                                      (                                          1                      -                                                                        p                          ~                                                1                                                              )                                                                                      ln              ⁡                              (                                  1                  -                                      kl                    /                    mn                                                  )                                              ≈                                    -                              mn                kl                                      ⁢            ln            ⁢                                                            1                  -                                                            p                      ~                                        0                                    -                                      p                    1                                                                                        (                                          1                      -                                                                        p                          ~                                                0                                                              )                                    ⁢                                      (                                          1                      -                                                                        p                          ~                                                1                                                              )                                                              .                                                          (        7        )            
RETRIEVAL ERRORS AND HIGH FIDELITY Let the noisy address pattern ũ contain c correct one-entries of uμ and an additional noise with ƒ false one-entries. Then for a given threshold Θ the probability p01 of an add-error and the probability p10 of a miss-error are
                                                                                          p                  01                                ⁡                                  (                  Θ                  )                                            =                              p                ⁡                                  (                                                                                    v                        ^                                            i                                        =                                                                  1                        ❘                                                  v                          i                          μ                                                                    =                      0                                                        )                                                                                                        =                                                ∑                                      x                    =                    Θ                                                        c                    +                    f                                                  ⁢                                                      p                                          WP                      ,                      lo                                                        ⁡                                      (                                                                  x                        ;                        k                                            ,                      l                      ,                      m                      ,                      n                      ,                                              M                        -                        1                                            ,                                                                        p                          ~                                                1                                            ,                                                                        p                          ~                                                0                                            ,                      c                      ,                      f                                        )                                                                                                          (        8        )                                                                                                      p                  10                                ⁡                                  (                  Θ                  )                                            =                              p                ⁡                                  (                                                                                    v                        ^                                            i                                        =                                                                  0                        ❘                                                  v                          i                          μ                                                                    =                      1                                                        )                                                                                                        =                                                ∑                                      x                    =                    0                                                        Θ                    -                    1                                                  ⁢                                                      p                                          WP                      ,                      hi                                                        ⁡                                      (                                                                  x                        ;                        k                                            ,                      l                      ,                      m                      ,                      n                      ,                                              M                        -                        1                                            ,                                                                        p                          ~                                                1                                            ,                                                                        p                          ~                                                0                                            ,                      c                      ,                      f                                        )                                                                                                          (        9        )            where pWP,lo, is the potential distribution of the “low” units (corresponding to the zero-entries of vμ) and pWP,hi is the potential distribution of the “high” units (corresponding to the one-entries of vμ).
The retrieval or output noise ε may be defined according to the expected Hamming distance dH between {circumflex over (v)} and vμ. Normalizing to the pattern activity l, one obtains
                              ε          ⁡                      (            Θ            )                          :=                                                            Ed                H                            ⁡                              (                                                      v                    ^                                    ,                                      v                    μ                                                  )                                      l                    =                                                                      n                  -                  l                                l                            ⁢                                                p                  01                                ⁡                                  (                  Θ                  )                                                      +                                          p                10                            ⁡                              (                Θ                )                                                                        (        10        )            An alternative definition based on the information I and transinformation (or mutual information) T of a binary channel is
                                          ε            T                    ⁡                      (            Θ            )                          :=                  1          -                                                    T                ⁡                                  (                                                            l                      /                      n                                        ,                                          p                      01                                        ,                                          p                      10                                                        )                                                            I                ⁡                                  (                                      l                    /                    n                                    )                                                      .                                              (        11        )            The latter definition is particularly useful when analyzing low quality retrieval for sparse patterns with l<<n. The former definition is useful because it allows an elegant analysis for pattern capacity and storage capacity. Now the optimal threshold Θoptε{0, 1, . . . , c+ƒ} can be defined by minimizing ε (or εT). For ε<<l, “high-fidelity” retrieval is obtained.
STORAGE CAPACITY Now, the maximal number of patterns Mε may be computed such that the retrieval noise is at most ε. Since the minimal Hamming distance (optimizing Θ) is obviously increasing with M, the pattern capacity Mε (on quality level ε) may be defined asMε:=max{M:(n−l)p01+lp10≦εl}.  (12)Considering the Shannon information of individual content patterns, one obtains the normalized network storage capacity in bits per synapses,
                              C          ε                :=                                            stored              ⁢                                                          ⁢              information                                      #              ⁢                                                          ⁢              synapses                                =                                                                      M                  ε                                ⁢                                  T                  ⁡                                      (                                                                  v                        μ                                            ;                                                                        v                          ^                                                μ                                                              )                                                              mn                        ≈                                                            M                  ε                                ⁢                                  T                  ⁡                                      (                                                                  l                        /                        n                                            ,                                              p                        01                                            ,                                              p                        10                                                              )                                                              m                                                          (        13        )            where T(vμ; {circumflex over (v)}μ) is the transinformation (or mutual information) between learned and retrieved content pattern. From the network capacity, further performance measures may be derived such as synaptic capacity CS and information capacity CI making use of the compressibility of the memory matrix for a memory load p1≠0.5,
                                          C            ε            S                    :=                                                    stored                ⁢                                                                  ⁢                information                                            #                ⁢                                                                  ⁢                non                ⁢                                  -                                ⁢                silent                ⁢                                                                  ⁢                synapses                                      =                                          C                ε                                            min                ⁡                                  (                                                            p                                              1                        ⁢                        ε                                                              ,                                          1                      =                                              p                                                  1                          ⁢                          ε                                                                                                      )                                                                    ,                            (        14        )                                                      C            ε            I                    :=                                                    stored                ⁢                                                                  ⁢                information                                            #                ⁢                                                                  ⁢                bits                ⁢                                                                  ⁢                of                ⁢                                                                  ⁢                required                ⁢                                                                  ⁢                physical                ⁢                                                                  ⁢                memory                                      =                                          C                ε                                            I                ⁡                                  (                                      p                                          1                      ⁢                      ε                                                        )                                                                    ,                            (        15        )            where p1ε is the memory matrix load after storing M pattern pairs, and I(p1ε):=−p1εldp1ε−(1−p1ε)ld(1−p1ε) is the Shannon information of a memory matrix component. Essentially, CS normalizes the stored information to the number of non-silent synapses (instead of all synapses). The variable CS is relevant for VLSI implementations of the Willshaw model where each synapse, similar to biology, must be realized in hardware. Similarly, CI normalizes the stored information to the amount of physical memory that is actually required to represent the memory matrix. This is relevant mainly for technical applications where the Willshaw matrix must be maintained in the memory of digital computers.
APPROXIMATIVE ANALYSIS OF RETRIEVAL WITH THE WILLSHAW THRESHOLD For retrieval with maximal “Willshaw” threshold Θ=|ũ| there is an elegant analysis based on a binomial approximation of the error probabilities. Unfortunately this is useful only for {tilde over (p)}0=0, i.e., no noisy zero-entries in the memory matrix, and ƒ=0, i.e., no false one-entries in the address pattern ũ (pattern part retrieval). In that case when c=|ũ|=λk and |=0 the probability of miss noise is p10=0 and the probability of add-noise can be approximatedp01≈p1λk  (16)assuming independently generated one-entries in the memory matrix and λ:=c/k. This assumption becomes correct in the limit of large networks for many cases. Requiring high fidelity level ε one obtains the bound p01≦εl/(n−l). Resolving for p1 the high fidelity memory load p1ε is obtained, and from this the high fidelity capacities,
                                                        p                              1                ⁢                ε                                      ≈                                          (                                                      ε                    ⁢                                                                                  ⁢                    l                                                        n                    -                    l                                                  )                                            1                                  λ                  ·                  k                                                      ⁢                          ≥                              (                !                )                                      ⁢                          p                              1                ,                min                                              ⇔                      k            ≈                                          ld                ⁢                                                      ε                    ⁢                                                                                  ⁢                    l                                                        n                    -                    l                                                                              λ                ⁢                                                                  ⁢                                  ldp                                      1                    ⁢                    ε                                                                        ⁢                          ≥                              (                !                )                                      ⁢                          k              min                                      :=                              ld            ⁢                                          ε                ⁢                                                                  ⁢                l                                            n                -                l                                                          λ            ⁢                                                  ⁢                          ldp                              1                ,                min                                                                        (        17        )                                          M          ε                ≈                                            λ              2                        ·                                          (                                  ldp                                      1                    ⁢                    ε                                                  )                            2                        ·            1                    ⁢          n          ⁢                                                    1                -                                  p                                      1                    ⁢                    ε                                                                              1                -                                                      p                    ~                                    1                                                      ·                          k              l                        ·                          mn                                                (                                      1                    ⁢                    d                    ⁢                                                                  n                        -                        l                                                                    ε                        ·                        l                                                                              )                                2                                              ⁢                                          ⁢          for          ⁢                                          ⁢                      p                          1              ⁢              ε                                      ≥                  p                      1            ,            min                                              (        18        )                                          C          ε                ≈                              λ            ·                          ldp                              1                ⁢                ε                                      ·            ln                    ⁢                                                    1                -                                  p                                      1                    ⁢                    ε                                                                              (                                  1                  -                                                            p                      ~                                        1                                                  )                                      ·            η                    ⁢                                          ⁢          for          ⁢                                          ⁢                      p                          1              ⁢              ε                                      ≥                  p                      1            ,            min                                              (        19        )            where η is usually near 1,
                    η        :=                                            T              ⁡                              (                                                      l                    n                                    ,                                                            ε                      ⁢                                                                                          ⁢                      l                                                              n                      -                      l                                                        ,                  0                                )                                                                    -                                  l                  n                                            ⁢              ld              ⁢                                                ε                  ⁢                                                                          ⁢                  l                                                  n                  -                  l                                                              ≈                                    1                              1                +                                                      ln                    ⁢                                                                                  ⁢                    ε                                                        ln                    ⁡                                          (                                              l                        /                        n                                            )                                                                                            .                                              (        20        )            
Note that equation 17 implies a minimal assembly size kmin for the address population. For smaller assemblies it is impossible to achieve high-fidelity ε, even when storing only a single association.
SPARSE AND DENSE POTENTIATION The main conclusions from the binomial approximative analysis are that a very high storage capacity of almost 0.7 bits per synapse can be achieved for sparse patterns with k˜log n and memory load p1=0.5. Then, one may store on the order of M˜mn/(log n)2 pattern associations with high retrieval quality. From equations 17 and 19 it is easy to see that asymptoticallyC>0k˜log n0<p1ε<1.  (21)Thus, the analysis suggests that neural associative memory is efficient (C>0) only for logarithmically sparse patterns. For sub-logarithmic sparse patterns with k/log n→0 one has p1→0 and for super-logarithmic sparse patterns with k/log n→∞ one has p1→1, both cases implying vanishing network storage capacity Cε→0.
In the following, the three cases p1→0/c/1 will be referred to as sparse, balanced, and dense synaptic potentiation.
It should be noted that this conclusion may be biased by the definition of network storage capacity, and alternative definitions of storage capacity considering the compressibility of the network lead to different conclusions. For example, in technical implementations of the Willshaw model the memory matrix can be compressed for p1→0/1 and the (information) storage capacity improves by factor I(p1):=−p1ldp1−(1−p1)ld(1−p1). This leads to the definition of “compression” capacities CI:=C/I(p1) and CS:=C/min(p1, 1−p1) (see eqs. 14,15,19).
Interestingly, and in contrast to network capacity C, optimizing CI and CS reveals highest capacities for p1→0 and p1→1. By this the regimes with ultrasparse and moderately sparse patterns (or cell assemblies) have gained increased attention.
However, the convergence of the binomial approximations towards the exact values may be questionable. In particular, for dense potentiation with p0ε=1−p1ε→0, supra-logarithmic sparseness, k/log n→∞, and
                                          p                          1              ⁢              ε                                =                                                    (                                                      ε                    ⁢                                                                                  ⁢                    l                                                        n                    -                    l                                                  )                                                              1                  /                  λ                                ⁢                                                                  ⁢                k                                      =                                          ⅇ                                                      ln                    ⁡                                          (                                              ε                        ⁢                                                                                                  ⁢                                                  l                          /                                                      (                                                          n                              -                              l                                                        )                                                                                              )                                                                            λ                    ⁢                                                                                  ⁢                    k                                                              ≈                              1                -                                                      ln                    ⁢                                                                  n                        -                        l                                                                    ε                        ⁢                                                                                                  ⁢                        l                                                                                                  λ                    ⁢                                                                                  ⁢                    k                                                                                      ,                            (        22        )                                                      M            ε                    =                                                    ln                ⁡                                  (                                      1                    -                                          p                                              1                        ⁢                        ε                                                                              )                                                            ln                ⁡                                  (                                      1                    -                                                                  k                        ⁢                                                                                                  ⁢                        l                                                                    m                        ⁢                                                                                                  ⁢                        n                                                                              )                                                      ≈                                          -                                                      m                    ⁢                                                                                  ⁢                    n                                                        k                    ⁢                                                                                  ⁢                    l                                                              ⁢                              (                                                      ln                    ⁡                                          (                                              λ                        ⁢                                                                                                  ⁢                        k                                            )                                                        -                                      ln                    ⁢                                                                                  ⁢                    ln                    ⁢                                                                  n                        -                        l                                                                    ε                        ⁢                                                                                                  ⁢                        l                                                                                            )                                      ≈                                          m                ⁢                                                                  ⁢                n                ⁢                                                                  ⁢                ln                ⁢                                                                  ⁢                k                                            k                ⁢                                                                  ⁢                l                                                    ,                            (        23        )            numerical simulations of small networks reveal that the capacities can be massively overestimated by the binomial approximative analysis. Nevertheless, the inventor has shown in an asymptotic analysis for large networks, that the binomial approximative analysis from above becomes exact at least for k=O(n/log2 n). That means, the regime of dense potentiation leads actually to high performance networks for a very broad range of pattern activities.
THE (EXCITATORY) DILUTED WILLSHAW MODEL In the following, it will be assumed that, on average, only a fraction p of the mn synapses are actually realized in the network. More precisely, it is assumed that a potential synapse is realized with probability p independently of other synapses. In other words, it is assumed that the network is incompletely connected.
Obviously, this case is already included in the theoretical framework presented above by using {tilde over (p)}0=1−p. For the sake of brevity, the matrix of realized synapses will be denoted asAp:=1−Ã0  (24)Unfortunately, as already mentioned, for p<1 there is no obvious optimal threshold anymore. Instead, the threshold needs to be adjusted “by hand”. For example, if the statistics of the address pattern (c and ƒ) are known, one can optimize the threshold in order to minimize the expected Hamming distance (eq. 10,11). However, numerical evaluations indicate that the performance of this naive implementation of the excitatory Willshaw model becomes quite bad for p<1. In particular, the synaptic storage capacity vanishes, C→0, for p→0.
Previous works have focused on finding optimal threshold strategies for diluted networks. It has been suggested to choose thresholds Θj individually for each content neuron j as a function of the so-called input activity
                                          θ            j                    :=                                    ∑                              i                =                1                            m                        ⁢                                                            u                  ~                                i                            ⁢                              A                                  p                  ,                                      i                    ⁢                                                                                  ⁢                    j                                                                                      ,                            (        25        )            defined as the number of active address neurons connected to neuron j (see FIG. 2, left panel). Simply using Θj=θj corresponds to the previous Willshaw threshold and again implies p10=0, i.e., no miss-noise in the retrieval result. For pattern part retrieval with ƒ=0 (no add-noise in the address pattern), this threshold strategy is optimal if, besides input activity, no further information is used. However, for an excitatory interpretation this strategy has a severe disadvantage: for technical applications, the matrix Ap of realized synapses must be represented explicitly, which inevitably decreases storage capacity. This becomes particularly severe for p1→0 or p1→1 when A can be compressed. Then memory requirements are dominated by Ap and therefore storage capacity vanishes.
In the following, the two model variants with the naive threshold control (described at the beginning of this section) and the optimal threshold control (eq. 25) will be referred to as the naive and optimal excitatory diluted Willshaw model, respectively.
In the light of these analyses, it is an object of the present invention to provide an improved method and device for realizing an associative memory on the basis of a neural network.
This object is achieved according to the invention by a method and a device for forming an associative memory according to the independent claims. Advantageous embodiments are defined in the dependent claims.
The associative memory is a digital computer memory that may be formed or realized by building up a corresponding matrix or structure in the digital memory of a general-purpose digital computer or physically implemented on a parallel (VLSI, FPGA) hardware architecture.
The method and device according to the invention have the following advantages:                The threshold strategy becomes very simple, even when using the input activity in diluted networks: For example, assume that the optimal threshold of the excitatory model is Θj=θj+c for an offset c. Then the optimal threshold for the inhibitory model is simply c, i.e., again independent of the individual input activities.        Consequently, it is not necessary to explicitely represent Ap which can dramatically increase the storage capacity if A can be compressed.        