Word alignment is an essential task indispensable for Statistical Machine Translation (SMT). FIG. 1 shows an example of word alignment.
Referring to FIG. 1, consider a bilingual sentence pair 20 of Japanese and English. Each sentence of bilingual sentence pair 20 is segmented word by word in advance. Bilingual sentence pair 20 includes a Japanese sentence 30 (“watashi|ga|riyou|ryoukin|wo|harau” (“|” represents word segmentation)), and an English sentence 32 (“I pay usage fees.”). Word alignment is a process of estimating, for each word forming Japanese sentence 30, for example, a word (or word group) of English sentence 32 into which it is translated, or, which word of Japanese sentence 30 corresponds to which word (or word group) of English sentence 32. While FIG. 1 shows word alignment from Japanese to English, word alignment from English to Japanese is done in the similar manner.
In SMT, such a word alignment plays a very important role. SMT prepares a bilingual corpus including a large number of bilingual pairs such as described above. Word alignment is done for each bilingual pair. Based on the word alignment, a translation model is built through a statistical process. This process is referred to as translation model training. In short, the translation model represents, in the form of probability, which word of one language would be translated to which word of the other language. In SMT, when a sentence of a source language is given, a number of candidate sentences of the translation target language (target language) are prepared; probability of the source language sentence generated from each candidate sentence of target language is computed, and the target language sentences that has the highest probability is estimated to be the translation of the source language. In this process, the translation model mentioned above is used.
Clearly, translation model with higher precision is necessary to improve SMT performance. For this purpose, it is necessary to improve word alignment precision of a bilingual corpus used in translation model training. Therefore, in order to improve SMT performance, it is desired to improve performance of a word alignment apparatus performing word alignment of bilingual pairs.
Prevalently used word alignment methods include IBM model (see Non-Patent Literature 1 below) and HMM model (see Non-Patent Literature 2). These models assume that word alignment is generated in accordance with a certain probability distribution, and estimate (learn) the probability distribution from actually observed word alignment (generation model). Given a source language sentence f1J=f1 . . . , fJ and a target language sentence e1I=e1, . . . , eI, the sentence f1J of the source language is generated from the sentence e1I of the target language via the word alignment a1J, and the probability of this generation is computed in accordance with Equation (1) below. In Equation (1), each aj is a hidden variable indicating that the source language word fj is aligned to the target language word ea_j. In the following texts, an underscore “_” indicates that a certain subscript notation further has a subscript notation, and parentheses “{ }” following the underscore indicate the range of subscripts. Specifically, a notation “ea_{j}” indicates that the subscript accompanying “e” is “aj” in normal expression, and the notation “ea_{j}−1” indicates that the subscript to “e” is aj−1, and the notation “ea_{j−1}” indicates that the subscript to “e” is aj−1.
                    p        (                                            f              1              J                        ⁢                                                        e                1                I                            )                                =                                    ∑                              a                1                J                                      ⁢                          p              (                                                f                  1                  J                                ,                                                      a                    1                    J                                    ⁢                                                                                e                      1                      I                                        )                                                                                                          (        1        )                                p        (                              f            1            J                    ,                                                    a                1                J                            ⁢                                                                e                  1                  I                                )                                      =                                          Π                                  j                  =                  1                                J                            ⁢                                                p                  a                                ⁡                                  (                                                            a                      j                                        ⁢                                                                                                                                                a                                                          j                              -                              1                                                                                ,                          j                                                )                                            ·                                                                        p                          t                                                (                                                  f                          j                                                                                                              ⁢                                          e                                              a                        j                                                                              )                                                                                        (        2        )            In Equation (2), pa is alignment probability, and pt is lexical translation probability.
For a bilingual sentence pair (f1J, e1I), these models specify a best alignment ^a (the symbol “^” is originally to be written immediately above the immediately following character) satisfying Equation (3) below, using, for example, forward-backward algorithm. The best alignment is referred to as Viterbi alignment.
                              a          ^                =                  arg          ⁢                                          ⁢                                    max                              a                1                J                                      ⁢                          p              ⁡                              (                                                      f                    1                    J                                    ,                                                            a                      1                      J                                        ❘                                          e                      1                      I                                                                      )                                                                        (        3        )            
Non-Patent Literature 3 proposes a method of alignment in which Context-Dependent Deep Neural Network for HMM, which is one type of Feed Forward Neural Networks (FFNN) is applied to the HMM model of Non-Patent Literature 2, so that the alignment score corresponding to alignment probability and lexical score corresponding to lexical selection probability are computed using FFNN. Specifically, a score sNN (a1J|f1J, e1I) of alignment a1J for a bilingual sentence pair (f1J, e1I) is represented by Equation (4) below.
                                          s            NN                    ⁡                      (                                                            a                  1                  J                                ❘                                  f                  1                  J                                            ,                              e                1                I                                      )                          =                              Π                          j              =              1                        J                    ⁢                                    t              a                        ⁡                          (                                                                    a                    j                                    -                                                            a                                              j                        -                        1                                                              ⁢                                                                                                                    c                          ⁡                                                      (                                                          e                                                              a                                                                  j                                  -                                  1                                                                                                                      )                                                                          )                                            ·                                                                        t                          t                                                (                                                                              f                            j                                                    ,                                                      e                                                          a                              j                                                                                                                                                                    ⁢                                          c                      ⁡                                              (                                                  f                          j                                                )                                                                                            ,                                  c                  ⁡                                      (                                          e                                                                        a                          j                                                -                        1                                                              )                                                              )                                                          (        4        )            
In the method of Non-Patent Literature 3, normalization over all words is computationally prohibitively expensive and, therefore, a score is used in place of probability. Here, ta and tt correspond to pa and pt of Equation (2), respectively. sNN represents a score of alignment a1J, and c(w) represents context of word w. As in the HMM model, Viterbi alignment is determined by forward-backward algorithm in this model.
FIG. 3 shows a network structure (lexical translation model) of the neural network for computing a lexical translation score tt(fj, ea_{j}|c(fj), c(ea_{j}−1)) of Equation (4). A neural network 60 shown in FIG. 3 has: an input layer (Lookup layer) 70 receiving words fj−1, fj and fj+1 of the source language and words ea_{j}−1, ea_{j} and ea_{j}+1 of the target language and converting these to a vector z0; a hidden layer 72 receiving vector z0 and outputting an output vector z1 in accordance with Equation (5); and an output layer 74 receiving vector z1 and computing and outputting a lexical translation score 76 in accordance with Equation (6). These layers have weight matrices L, {H, BH} and {O, BO}, respectively. Though an example having one hidden layer is described here, two or more hidden layers may be used.
Weight matrix L is an embedding matrix that manages word embedding of each word. Word embedding refers to a low dimensional real-valued vector representing syntactic and semantic properties of a word. When we represent a set of source language words by Vf, a set of target language words by Ve and the length of word embedding by M, the weight matrix L is a M×(|Vf|+|Ve|) matrix. It is noted, however, that <unk> representing an unknown word and <null> representing null are added to Vf and Ve, respectively.
The lexical translation model receives as inputs the source language word fj and the target language word ea_{j} as objects of computation, as well as their contextual words. The contextual words refer to words existing in a window of a predetermined size. Here, window width of 3 is assumed, as shown in FIG. 3. Input layer 70 includes a source language input unit 80 and a target language input unit 82. Source language input unit 80 receives the source language word fj as the object of computation as well as preceding and succeeding two words fj−1 and fj+1, finds corresponding columns of embedding matrix (L), and outputs source language portion of the word embedding vector. Target language input unit 82 receives the target word ea_{j} as the object of computation as well as preceding and succeeding two words ea_{j}−1 and ea_{j}+1, finds corresponding columns from embedding matrix (L), and outputs target language portion of the word embedding vector. The output of source language input unit 80 and the output of target language input unit 82 are concatenated to form a real-valued vector Z0, which is supplied to the input of hidden layer 72. Then, hidden layer 72 captures non-linear features of real-valued vector z0 and outputs a vector z1. Finally, output layer 74 receives vector z1 output from hidden layer 72, and computes and outputs a lexical translation score 76 given by the following equation.
                              t          t                (                              f            j                    ,                      e                          a              j                                                  ⁢              f                  j          -          1                          j          +          1                      ,          e                        a          j                -        1                              a          j                +        1              )
Specific computations in hidden layer 72 and output layer 74 are as follows.z1=ƒ(H×z0+BH),  (5)tt=O×z1+BO  (6)where H, BH, O and BO are a |z1|×|Z0|, |z1|×1, 1×|z1| and 1×1 matrix, respectively. Further, f(x) is a non-linear activation function, and h tan h(x) is used here, which is represented as:
  h  ⁢          ⁢      tanh    ⁡          (      x      )        ⁢      {                                        =                          -              1                                                            (                          x              <                              -                1                                      )                                                            =            1                                                (                          x              >              1                        )                                                            =            x                                                (            Otherwise            )                              
The alignment model for computing alignment score ta(aj−aj−1|c(ea_{j}−1)) can be formed in the same way.
In the training of each model, weight matrices of each layer are trained using Stochastic Gradient Descent (SGD) so that the ranking loss represented by Equation (7) below is minimized. The gradients of weights are computed by back propagation.
                              loss          ⁡                      (            θ            )                          =                              ∑                                          (                                  f                  ,                  e                                )                            ∈              T                                ⁢                      max            ⁢                                          {                                  0                  ,                                      1                    -                                                                  s                        θ                                            ⁡                                              (                                                                                                                                            a                                +                                                            ⁢                                                                                                                                f                                  ,                                  e                                                                )                                                                                      +                                                                                                                            s                                  θ                                                                (                                                                  a                                  -                                                                                                                            ⁢                              f                                                                                ,                          e                                                )                                                                                            }                            .                                                          (        7        )            where θ denotes the parameters to be optimized (weights of weight matrices), T is training data, sθ denotes the score of a1J computed by the model under parameters θ (see Equation (4)), a+ is the correct alignment, and a− is the incorrect alignment that has the highest score by the model under parameters θ.