One example of similar text search systems according to related arts is the one described in Patent Literature 1, which displays similar example sentences needed for translating an input sentence in an easy-to-see manner by grouping such similar example sentences. The similar text search system according to the related art described in Patent Literature 1 calculates the similarity between an input sentence and pre-accumulated example sentences by using a technique called “DP matching.” The system then outputs similar example sentences based on the results of calculation for similarity. In a Non-Patent Literature 1, the technique “DP matching” is described.    Patent Literature 1: Japanese Patent Laying-Open No. 2006-106474 publication (paragraph 0016-0024, FIG. 3-4)    Non-Patent Literature 1: “Acoustic and Audio Engineering” by Sadaoki Furui, Kindai Kagakusha
If the technique described in Non-Patent Literature 1 is used, a similarity can be obtained through DP matching by calculating the expression (1) (refer to the expression (14.14) on p. 184 in Non-Patent Literature 1).g(i,j)=min{g(i,j−1)+d(i,j),g(i−1,j−1)+2d(i,j),g(i−1,j)+d(i,j)}  Expression (1)
where in the expression (1), d(i, j) is the distance between the i-th element x [i] (1≦i≦I) in the sequence X and the j-th element y [j] (1≦j≦J) in the sequence Y (hereinafter also referred to as the “local distance”). Suppose, for example, X is a string “SHI KYU KA SHI TE KU DA SA I” (Lend me as soon as possible) and Y is a string “KA SHI TE KU DA SA I” (Lend me). FIG. 5 shows an example of local distances d (i, j) between the elements of the sequence X and the elements of the sequence Y. In the example shown in FIG. 5, d (i, j)=0 if x [i] and y [j] are the same character and d (i, j)=1 if otherwise.
As explained above, given local distances d (i, j), g (I, J) can be obtained by calculating the distance g (i, j) sequentially from g (1, 1) (hereinafter the distance g (i, j) will also be referred to as the “cumulative distance”). The value of g(I, J) thus obtained is the value which indicates the similarity between the two sequences X and Y.
Next, an example configuration of a similar text search system will be described with reference to the attached drawings. FIG. 1 is a block diagram showing an example configuration of a similar sentence search system. As shown in FIG. 1, the similar text search system includes a similarity calculation unit 1 which calculates the similarity between an input sentence and an example sentence, an example sentence storage unit 2 which stores example sentences to be searched and a similarity storage unit 3 which stores similarities calculated by the similarity calculation unit 1.
The similarity calculation unit 1 has functions to calculate the similarity between an input sentence and each of the example sentences stored by the example sentence storage unit 2 and to pass (or output) the resultant similarity to the similarity storage unit 3. The example sentence storage unit 2 has a function to pass (or output) the example sentences that it stores, one by one, to the similarity calculation unit 1. The similarity storage unit 3 has a function to store the similarities calculated by the similarity calculation unit 1. The similarity storage unit 3 also has a function to output example sentences with high stored similarities.
Next, the configuration of the similarity calculation unit 1 will be described. FIG. 9 is a block diagram showing an example configuration of the similarity calculation unit. As shown in FIG. 9, the similarity calculation unit 1 of the similar text search system includes an input string storage unit 911, a local distance calculation unit 912, a local distance storage unit 913, a cumulative distance calculation unit 915 and a cumulative distance storage unit 916.
The input string storage unit 911 stores an input sentence and an example sentence to be subjected to similarity calculation. The local distance calculation unit 912 has a function to calculate local distances d (i, j) based on the string stored by the input string storage unit 911. The local distance storage unit 913 stores the local distances d (i, j) calculated by the local distance calculation unit 912. The cumulative distance calculation unit 915 has a function to calculate a new g (i, j) based on the d (i, j) stored by the local distance storage unit 913 and the g (i, j) stored by the cumulative distance storage unit 916. The cumulative distance storage unit 916 stores the g (i, j) value calculated by the cumulative distance calculation unit 915.
In the example shown in FIG. 9, the input string storage unit 911 stores an input sentence and an example sentence to be subjected to similarity calculation, and the local distance calculation unit 912 calculates for all of the points (i, j) the local distance d (i, j) between each element of the input sentence and each element of the example sentence stored in the input string storage unit 911. The local distance storage unit 913 also stores all the d (i, j) values calculated by the local distance calculation unit 912. The cumulative distance calculation unit 915 sequentially calculates a new g (i, j) value based on the d (i, j) values stored by the local distance storage unit 913 and the g (i, j) values stored by the cumulative distance storage unit 916. The cumulative distance storage unit 916 then stores the g (i, j) values calculated by the cumulative distance calculation unit 915 and, on completion of calculating all of the g (i, j) values, outputs g (I, J) as the similarity between the input sentence and example sentence.
Next, the operation of the similarity calculation unit 1 of the similar sentence search system will be described. FIG. 10 is a flow chart which shows an example of the similarity calculation process performed by the similarity calculation unit 1 to calculate the similarity between an input sentence and an example sentence. This example assumes that the similar text search system has a string “KYU U KA KU DA SA I” (Let me take a leave of absence) as an input sentence Y and a string “SHI KYU U KA SHI TE KU DA SA I” (Lend me as soon as possible) as an example sentence X. When the input sentences X and Y have been passed (inputted), the similarity calculation unit 1 temporarily stores the input sentences X and Y in the input string storage unit 911.
Next, the local distance calculation unit 912 calculates the local distance d (i, j) between each element x [i] of X and each element y [i] (Step S91 in FIG. 10). FIG. 5 is an illustrative diagram which shows examples of the calculation results of local distances obtained by the local distance calculation unit 912. The similar text search system stores all the calculation results as shown in FIG. 5 in the local distance storage unit 913. The local distance calculation unit 912 performs the local distance calculation on all of the points (i, j) which satisfy 1≦i≦I and 1≦j≦J. In other words, the local distance calculation unit 912 repeats the process of Step S91 until all of the points have been calculated (Step S92 in FIG. 10).
Next, the cumulative distance calculation unit 915 calculates g (i, j) based on the expression (1) (Step S94 in FIG. 10). The similar text search system stores the results of calculating g (i, j) in the cumulative distance storage unit 916.
The path which follows the lowest selections in the expression (1) above inversely from g (I, J) is referred to as a “DP path.” A DP path is a path which indicates partial correspondence between sequences X and Y identified during calculation of similarity. FIG. 11 is an illustrative diagram which shows an example of a DP path. In the example of FIG. 11, the path indicated by the arrows within the figure represents a DP path. In the example of FIG. 11, “KYU U” (“soon” in “as soon as”) in X, for example, is corresponded by “KYU U” (“leave” in “leave of absence”) in Y, while “KA SHI TE” (lend me) in X by “KA” (“absence” in “leave of absence”) in Y.
In the calculation at Step S94 above, the cumulative distance calculation unit 915 calculates a new g (i, 1) value by sequentially incrementing “i” by 1 from g (1, 1). When g (I, 1) has been calculated, the cumulative distance calculation unit 915 returns calculation to i=1 and calculates a new g (i, 2) value by incrementing “j” by 1 and incrementing “i” by 1 from g (1, 2). Hereafter, the cumulative distance calculation unit 915 repeats the process of Step S94 until all of the g (i, j) values have been calculated (Step S95 in FIG. 13).
FIG. 12 is an illustrative diagram which shows, examples of calculation results of g (i, j) obtained by the cumulative distance calculation unit 915. After all of the values of g (i, j) have been calculated, the cumulative distance storage unit 916 outputs the value“3” of g (I, J) (i.e., the value at the lowest-rightmost box shown in FIG. 12) as the similarity between “KYU U KA KU DA SA I” (Let me take a leave of absence) and “SHI KYU U KA SHI TE KU DA SA I” (Lend me as soon as possible).
Similarly, it is assumed that the similar text search system has a string “KYU U KA KU DA SA I” (Let me take a leave of absence) as an input sentence Y and a string “A SU WA KYU U KA KU DA SA I” (Let me take a leave of absence tomorrow) as an example sentence X. When the input sentences X and Y have been passed (inputted), the local distance calculation unit 912 calculates the local distances d (i, j), as shown in FIG. 7. Also, the cumulative distance calculation unit 915 calculates the similarity g (i, j) as shown in FIG. 13, based on the d (i, j) values shown in FIG. 7. The cumulative distance storage unit 916 then outputs the value “3” of g (I, J) as the similarity between the input sentence X and Y described above.
However, even with a similarity search using the DP matching technique, it may always not be possible to properly determine the similarity between two input sentences. For example, in the example described above, when “KYU U KA KU DA SAI” (Let me take a leave of absence) and “SHI KYU U KA SHI TE KU DA SA I” (Lend me as soon as possible) are matched with each other (FIG. 13), “KA” in “KYU U KA” (“absence” in “leave of absence”) and “KA SHI TE” (Lend me) were corresponded to each other, which is unnatural correspondence from a perspective of a semantic or grammatical delimitation. Therefore, cases may often occur in which the similarity between two relatively similar sentences, such as “KYU U KA KU DA SA I” (Let me take a leave of absence) and “A SU WA KYU U KA KU DA SA I” (Let me take a leave of absence tomorrow), is the same as the similarity between two non-similar sentences, such as “KYU U KA KU DA SA I” (Let me take a leave of absence) and “SHI KYU U KA SHI TE KU DA SA I” (Lend me as soon as possible.”) In this example, both sentence pairs have a similarity of 3. In these cases, sufficient search accuracy will not be achieved by performing a similar sentence search using the similar text search system.
An object of the present invention is to provide a similar text search method, a similar text search system and a similar text search program which enable accurate search for similar sentences. Another object of the present invention is to provide a similar text calculation method, a similar text calculation system and a similar text calculation program which enable higher accuracy calculation of similarity.