1. Field of the Invention
The present invention relates to an associative memory in a form of a neural network utilized in the cognition and comprehension scheme.
2. Description of the Background Art
The cognition and comprehension scheme based on the so called connectionist model can be utilized for a translation word selection in a natural language translation, as well as for a homonym selection in a Japanese word processing.
In this connectionist model, it is necessary to carry out a learning of the associative memory using non-random patterns. However, conventional associative memories are suited only for the learning using random patterns, where the random pattern is a pattern given as a sequence of N elements each of which taking a value of either 0 and 1, in which the values of the i-th element and the j-th (j.noteq.i) element are determined independently. Consequently, such a conventional associative memory has been inapplicable to the cognition and comprehension scheme based on the connectionist model.
This problem of the conventional cognition and comprehension scheme based on the connectionist model will now be described in detail, using a case of the kana-kanji conversion (conversion from Japanese syllabaries to Chinese characters) in the Japanese word processing as an illustrative example.
&lt;&lt;The kana-kanji conversion based on the connectionist model&gt;&gt;
In recent years, numerous researches of the cognition and comprehension scheme based on the connectionist model have been undertaken. (See, for example, D. L. Waltz and J. B. Pollack, "Massively Parallel Parsing: A Strongly Interactive Model of Natural Language Interpretation", Cognitive science, Vol. 9, pp. 51-74, 1985.)
In this cognition and comprehension scheme, each symbol is represented by a node, and relationships among symbols are represented by a network connecting nodes together, in which a topic of the input information is recognized and semantically comprehended by propagating activation values assigned to the nodes through the network. This cognition and comprehension scheme is also applicable to the speech or letter recognition in addition to the natural language processing.
In particular, there are many researches for an application of this cognition and comprehension scheme to the kana-kanji conversion in the Japanese word processor, as can be seen for example in Japanese Patent Application Laid Open (Kokai) No. 3-22167 (1991). In this reference, the learning of the network is carried out by using a large number of actual documents, and the activation values of the nodes are propagated in response to the input information entered by the user as a key, such that the topic of the input document can be comprehended. Then, using the comprehended topic of the input document, the accuracy of the kana-kanji conversion is improved as follows.
Namely, in a case of carrying out the kana-kanji conversion in the Japanese word processor, a series of kanas (Japanese syllabaries) representing a reading of desired words is entered, and an appropriate letter series using both kanas and kanjis (Chinese characters) with appropriate portions of the entered kana series converted into kanjis is returned in response. In this process, it is necessary to select an appropriate kanji among a number of candidate kanjis which have the reading represented by the entered kana series.
In order to enable this selection, there is provided a network indicating proximities of words defined in terms of their usages, in which a link with a large positive link weight value is formed between two nodes corresponding to two words which are frequently used together. Then, when a particular word related to a current topic is selected, an activation value of the node corresponding to the selected word is raised such that the raised activation value is propagated from this node to the other nodes linked with this node, and the activation values of the nodes corresponding to the words related to the current topic are also raised as a result. Here, the propagation of the activation value is carried out through the links, where a link with a larger link weight value can propagate a larger part of the activation value.
Thus, as the input sentences are entered into the Japanese word processor, the activation values are propagated in the network to raise the activation values of the nodes corresponding to those words which are strongly related to a current topic of the input sentences. Then, at a time of the kana-kanji conversion, a homonym having the highest activation value among the homonyms is regarded as the most appropriate for the current topic, and selected as the primary candidate for the kana-kanji conversion.
For example, in a case of the kana-kanji conversion of the input sentenses as shown in FIG. 1, the selections of the words A1 ("clock"), A2 ("signal"), and A3 ("processor"), all of which belonging to the computer hardware vocabulary A0, had been already made, so that the activation values of the nodes corresponding to the computer hardware vocabularies have been raised. In this state, for the kana-kanji conversion of a reading B0, there are three candidate homonyms C1 ("synchronization"), C2 ("palpitation"), and C3 ("motivation") which share this same reading B0. Here, however, the candidate homonym C1 ("synchronization") has the highest activation value as it belongs to the computer hardware vocabularies A0, so that the kana sequence of the reading B0 will be converted into the kanjis of the homonym candidate C1.
This network has a function of associating a topic with the input information, so that it can be considered as equivalent to the associative memory in a form of a neural network.
&lt;&lt;Patterns used in the associative memory&gt;&gt;
In the cognition and comprehension scheme based on the connectionist model such as the kana-kanji conversion described above, the learning and the association operations of the associative memory are carried out by using the patterns defined as follows. (See Japanese Parent Application Laid Open (Kokai) No. 3-179560 (1991) for further detail.)
Namely, the words are assigned with the word codes given in terms of consecutive integers from 1 to N, in one to one correspondence, and each pattern is given as a bit pattern in a form of a sequence of N elements each of which taking a value of either 0 and 1. Here, the bit pattern can be determined by the following procedure.
(1) Divide sentences in units of paragraphs. PA1 (2) Decompose each paragraph into words, and convert each word into its word code. PA1 (3) For each paragraph, a bit of the bit sequence corresponding to the word code of the word contained in this paragraph is set to 1, while a bit of the bit sequence corresponding to the word code of the word not contained in this paragraph is set to 0. PA1 "The"=1034 PA1 "whether"=22378 PA1 "is"=123 PA1 "fine"=2876 PA1 "today"=10120
For example, for a very brief paragraph of "The whether is fine today.", let the correspondences between the words and the word codes be as follows.
Then, this paragraph can be represented by the following set of the word codes. EQU {123, 1034, 2876, 10120, 22378}
Thus, when the total number of words N is equal to 100,000, the bit pattern for this paragraph has 100,000 bits of which 123-th, 1034-th, 2876-th, 10120-th, and 22378-th bits are set to 1, while all the other bits are set to 0.
Here, the bit pattern ignores any redundant appearance of the words, so that even when the same word appears more than once in the same paragraph, the bit pattern is unaffected.
Now, it is quite inevitable for the patterns so determined to have some noise or non-randomness. For example, in the sentence "I bought a book yesterday.", there is no direct semantic relationship between the words "book" and "yesterday", yet the bit pattern for this sentence will have the bits corresponding to these words "book" and "yesterday" set to 1 as they both appear in the same sentence, so as to introduce the noise or non-randomness into the patterns.
Consequently, as already mentioned above, the conventional associative memories which are suited only for the learning using random patterns are inapplicable to the network used in the kana-kanji conversion.
&lt;&lt;Characteristics of the patterns in the connectionist model&gt;&gt;
Now, in the connectionist model, the above described patterns, to be used in the learning of the network dealing with the natural language using the actual documents, must have the following characteristics.
P1: There is no noiseless, exactly correct memorized pattern (in which 1/0 bit is allocated to a node representing a symbol in the network).
There are many patterns (actual documents) that can be used as the learning patterns, but almost all of these contain some noise (words unrelated to the topic). For this reason, it is necessary to learn generalized patterns in which the effect of the noise is removed, by using a very large number of the patterns. In other words, it is not absolutely necessary to memorize the exact patterns.
P2: Each pattern contains only a very small fraction of all the available words.
There are about several hundred thousand words used in Japanese, and any individual may use about several thousand words among them, but one paragraph usually contain about several hundred words. Such a pattern containing only a small fraction of all the available words is called a sparsely encoded pattern.
P3: There are large differences among the frequencies of appearances for different words.
For example, a demonstrative pronoun "this" appears very frequently, regardless of the topic of the sentence, but a specialized word "postscript" appears only very rarely in the specialized context alone. The frequency of appearance of any word can be heavily dependent on the topic of the input sentences, but there are those words having very high frequencies of appearances as well as those words having low frequencies of appearance, regardless of the topic of the sentences.
P4: It must be possible to carry out additional earnings.
In the connectionist model, a large number of learning patterns (actual documents) are going to be given one after another. In order to cope with this situation, it must be possible to carry out the additional learnings. Namely, the learning of the additional patterns must be made easily.
P5: There is a non-randomness in the frequencies of appearances for patterns.
It is impossible to collect the patterns uniformly for all the possible topics, without specifying any particular topic. For example, when the patterns are collected from a newspaper, it is likely that one hundred patterns related to the politics are collected while only ten patterns related to the science and technology are collected. In such case of having a non-randomness in the frequencies of appearances, there is actually no way of telling which topic appeared how frequently.
P6: There is a non-randomness in the correlations among the patterns.
In general, the topics are not totally independent, and which topics are strongly related with each other depend on the situations of the sentences. In particular, it is necessary to note that there is a non-randomness in the correlations depending on the topics. For example, the politics and the science and technology may not be so strongly correlated, but the politics and the economy can be quite strongly correlated.
&lt;&lt;Conventionally available associative memory&gt;&gt;
The associative memory suitable for the connectionist model should be capable of grasping the topic from a group of the words that are frequently appearing together in a large number of sentences, such that the group of the words related to a topic of an input sentence can be presented.
Now, for the associative memory in a form of neural network, there are two manners of learning including an orthogonal learning and a correlational learning. However, the orthogonal learning requires the learning patterns without any noise, and it is also not suitable for the additional learning. As already mentioned above, the associative memory for the connectionist model should be able to deal with noise, and learn the generic pattern from a number of learning patterns containing noise by generalization, so that the correlational learning is more appropriate for this purpose.
In the correlational learning of the associative memory, a matrix to be used for representing the network can be chosen from a correlation matrix and a covariance matrix. However, it is known that the associative memory using the correlation matrix cannot memorize the sparsely encoded patterns.
Thus, it can be concluded that, for the associative memory to be used in the connectionist model, the associative memory using the covariance matrix is most appropriate among the conventionally available associative memories.
&lt;&lt;The associative memory using the covariance matrix&gt;&gt;
The associative memory using the covariance matrix has been proposed for the purpose of memorizing the random sparsely encoded patterns. (See, S. Amari, "Neural Theory of Association and Concept-Formation", Biological Cybernetics, Vol. 26, pp. 175-185, 1977; S. Amari, "Characteristics of Sparsely Encoded Associative Memory", Neural Networks, Vol. 2, pp. 451-457, 1989; and C. J. Perez-Vicente, "Finite-Size Capacity of Sparse-Coding Models", Europhysics Letters, Vol. 10, pp. 627-631, 1989, for further details.) This associative memory sequentially selects one node at random from N nodes, and updates the activation value according to the following expression (1). ##EQU1## where V.sub.j is an activation value of the j-th node, a.sub.i is an activation probability for the activation value V.sub.i of the i-th node to be 1, I.sub.j is a threshold for the j-th node, f is a threshold function which can be expressed by the following equation (2): ##EQU2## and WJi is a link weight value for a link between the j-th node and the i-th node which is updated according to the following expression (3): ##EQU3## where .DELTA. is a small constant 0&lt;.DELTA.&lt;&lt;1.
This updating of the activation value is repeated until the activation values of all the N nodes become stable.
More specifically, the learning of links in the conventional associative memory using the covariance matrix will be described in detail with references to FIG. 2 and FIG. 3, where FIG. 2 shows a configuration of an apparatus for learning of the links in the associative memory, and FIG. 3 shows a flow chart for the learning operation of this apparatus of FIG. 2.
In this apparatus of FIG. 2, the link weight values of the network and the activation probabilities at the nodes are learned. To this end, the apparatus comprises: a pattern presentation unit 41 for presenting each pattern to be learned; an activation probability updating unit 42 for updating the activation probability of each node according to the pattern presented by the pattern presentation unit 41 and the former activation probabilities; an activation probability storing unit 43 for storing the activation probabilities updated by the activation probability updating unit 42 and supplying formerly stored activation probabilities as the former activation probabilities to the activation probability updating unit 42 at a time of updating; an activation probability read out unit 44 for reading out the activation probabilities stored in the activation probability storing unit 43; a link weight value learning unit 45 for learning the link weight value of each link according to the pattern presented by the pattern presentation unit 41, the former link weight values, and the former activation probabilities supplied from the activation probability storing unit 43; a link weight value storing unit 46 for storing the link weight values learned by the link weight value learning unit 45 and supplying formerly stored link weight values as the former link weight values to the link weight value learning unit 45 at a time of learning; a link weight value read out unit 47 for reading out the link weight values stored in the link weight value storing unit 46; and an initialization commanding unit 48 for issuing an initialization command for resetting the activation probabilities stored in the activation probability storing unit 43 and the link weight values stored in the link weight value storing unit 46 to initial values 0.
More specifically, in this apparatus of FIG. 2, the learning operation is carried out according to the flow chart of FIG. 3 as follows.
Namely, indices j and i are initialized to 1 at the steps 501 and 502, respectively, and the link weight value W.sub.ji is initialized to 0 at the step 503.
Next, the index i is incremented by one at the step 504, and whether the incremented index i is less than or equal to N or not is determined at the step 505, and then the steps 503 and 504 are repeated until the index i reaches to N.
After the steps 503 and 504 are repeated for the index i equal to N, an activation probability a.sub.j is initialized to 0 at the step 506 and the index j is incremented by one at the step 507. Then, whether the incremented index j is less than or equal to N or not is determined at the step 508, and then the steps 502 to 508 are repeated until the index j reaches to N so as to complete the initialization routine of the steps 501 to 508.
Next, at the step 509, whether there is any pattern to be learned or not is determined such that the learning routine of the following steps 510 to 522 are repeated as long as there is a pattern to be learned.
At the step 510, the pattern V to be learned is entered, and the indices j and i are initialized to 1 again at the steps 511 and 512, respectively.
Then, whether the indices 1 and j are equal to each other or not is determined at the step 513. In a case these indices i and j are not equal, next at the step 514, the link weight value W.sub.j is updated according to the expression (3) for a case of i.noteq.j, described above. This step 514 is skipped in a case these indices i and j are equal as W.sub.ii is to remain at zero according to the expression (3).
Next at the step 515, the index i is incremented by one, and whether the incremented index i is less than or equal to N or not is determined at the step 516, and then the steps 513 to 516 are repeated until the index i reaches to N.
After the steps 513 to 516 are repeated for the index i equal to N, the index j is incremented by one at the step 517, and whether the incremented index j is less than or equal to N or not is determined at the step 518, and then the steps 512 to 518 are repeated until the index j reaches to N so as to complete the updating of the link weight value W.sub.ji.
Next, at the step 519, the index j is initialized to 1 again, and then, at the step 520, the activation probability a.sub.j is updated according to the following expression (4). EQU a.sub.j .rarw.(1-.DELTA.)a.sub.j +.DELTA.V.sub.j ( 4)
Next, at the step 521, the index j is incremented by one, and whether the incremented index j is less than or equal to N or not is determined at the step 522, and then the steps 520 to 522 are repeated until the index j reaches to N so as to complete the updating of the activation probability a.sub.j.
After the steps 520 to 522 are repeated for the index j equal to N, the operation returns to the step 509 described above, so as to proceed to the learning routine of the steps 510 to 522 for a next pattern to be learned.
When the learning routine is completed for all the patterns to be learned, the link weight values no longer change so that the operation proceeds to the step 523 at which the updated link weight value W.sub.ji and the updated activation probability a.sub.j are stored in the link weight value storing unit 46 and the activation probability storing unit 43, respectively, in correspondence with each other, and the learning operation terminates.
Now, it can be verified that, in the above described learning scheme of the conventional associative memory using the covariance matrix, the first four of the six characteristics P1 to P6 of the patterns in the connectionist model described above can be accounted. Namely, the memorized pattern is learned as the link weight value so that the characteristic P1 is accounted. Also, this associative memory using the covariance matrix is originally devised for the memorization of the random sparsely encoded patterns, so that it is certainly suitable for the sparsely encoded patterns as required by the characteristic P2. Also, the characteristic P3 is accounted by the introduction of the activation probability a.sub. for the activation value. Also, the additional learning of the memorization pattern is possible as can be seen in the expression (3), so that the characteristic P4 is also accounted.
However, the last two of the six characteristics P1 to P6 of the patterns in the connectionist model cannot be accounted in the above described learning scheme of the conventional associative memory using the covariance matrix.
First, this associative memory is not suitable for a case in which the frequencies of appearances for the patterns are non-random as required by the characteristic P5 for the following reason. Namely, this associative memory has a tendency to memorize only the frequently appearing patterns. This fact can be seen from the expression (3) for updating the link weight value in which an updating value (V.sub.j -a.sub.j)(V.sub.i -a.sub.i) for the newly presented pattern is added to the former link weight value W.sub.ji every time a new pattern is presented. As a consequence, the more frequently appearing pattern will have the larger link weight value, so that the more frequently appearing pattern is more likely recalled, while the less frequently appearing pattern is less likely recalled because the less frequently appearing pattern has been learned only very rarely.
This fact that this associative memory has a tendency to memorize only the frequently appearing patterns can be explained in terms of the so called energy surfaces for the patterns. Here, the energy E of a pattern V is defined by the following equation (5). ##EQU4## where k is assumed to be a negative constant coefficient in the following. Namely, as shown in FIG. 4A in which the energy surfaces (reduced to two dimensions for simplicity) for the patterns V are plotted, the energy is lower for the more frequently appearing pattern-A compared with the less frequently appearing pattern-B. The associative memory is more likely to fall into the lower energy state, so that only the more frequently appearing pattern-A having the lower energy will most likely be recalled while the less frequently appearing pattern-B having the higher energy will be highly unlikely to be recalled.
Next, this associative memory is also not suitable for a case in which the correlations among the patterns are non-random as required by the characteristic P6. In fact, this associative memory is suited only to a case of a uniform correlation among the patterns such as that of the random sparsely encoded patterns.
In this case, the problem is actually two folds. In the first place, there is a problem that an intermediate pattern of a plurality of strongly correlated learning patterns is recalled rather than the desired pattern. However, this problem is not so serious because this problem itself is absent when the correlation is not so strong, while the intermediate pattern will not be largely different from the desired pattern when the correlation is indeed strong. Far more serious is the problem that the only weakly correlated pattern cannot be recalled. This fact can also be explained in terms of the energy surfaces for the patterns. Namely, when three patterns-1, -2, and -3 have appeared at the same frequency, the energies of these patterns-1, -2, and -3 are as shown in FIG. 5A. Consequently, the intermediate pattern of the strongly correlated patterns-1 and -2 will be recalled very easily as it is located at a very low energy, but the pattern-3 which is very weakly correlated with the patterns-1 and -2 will be very difficult to recall.
Thus, in the conventional associative memory, it has been difficult to make the appropriate learning for facilitating the desired association function when the patterns has the non-random frequencies of appearances or the non-random correlations.
In other words, the association by an artificial neural network is achieved by updating (propagating) the activation values of the nodes (neurons) such that the energy E is minimized. For this reason, the activation value of each node usually falls into the lower energy states. When the patterns are random, the energies of the patterns are almost identical, so that the activation values of the nodes will fall into the appropriate patterns. However, when the patterns are non-random, the energies of the patterns are diversified, so that the activation values of the nodes will most likely fall into the patterns with the lower energies and the patterns with the higher energies will be difficult to recall.
It is to be noted here that this situation of the conventional associative memories remains the same even when a constant coefficient k is set to be positive.