As one of methods for retrieving a text, a method for creating an inverted index table has been known. In a case where we assume a word as a retrieval unit, the inverted index table is created in the following steps: (1) one or more document ID numbers assigned to one or more documents each of which includes a target word are related to the target word; and (2) a pair of the target word and a list of the one or more document ID numbers is stored in a database. In a case where an inverted index table is previously stored in a speech data retrieval apparatus, when a user inputs a target word into the speech data retrieval apparatus as a query, the speech data retrieval apparatus can instantaneously obtain one or more document ID numbers assigned to one or more documents each of which includes the input target word therein, with reference to the inverted index table, and provide the one or more document ID numbers to the user.
As shown in FIG. 1, an inverted index table consists of plural data sets. Each data set includes a word and a list of document ID numbers. For example, a word “Osaka” is included in two documents “2” and “8”.
When a user inputs a word string into a speech data retrieval apparatus as a query, the speech data retrieval apparatus divides the input word string into words and retrieves one or more document ID numbers assigned to one or more documents each of which includes the words therein. Then, in order to check adjacency between words, the speech data retrieval apparatus obtains one or more document ID numbers assigned to one or more documents each of which includes therein the words that are arranged in order of appearances of words in the input word string, with reference to an inverted index table, and provides the one or more document ID numbers to the user. It is noted that an appearance position of each word in a target word string with a document ID number may be stored in the inverted index table so that the speech data retrieval apparatus easily checks the adjacency between words.
As shown in FIG. 2, an inverted index table consists of plural data sets. Each data set includes a word and a pair of a document ID number and an appearance position of the word. For example, a word “Osaka” appears in an eleventh position in a document “2” and a fourteenth position in a document “8”. In a case of retrieving one or more documents each of which includes a word string “Tokyo Osaka” therein, the speech data retrieval apparatus retrieves one or more documents each of which includes both words “Tokyo” and “Osaka” therein, with reference to the inverted index table, and checks whether or not an appearance position of the word “Tokyo” is adjacent to one of the word “Osaka”, with respect to each retrieved document. More specifically, the speech data retrieval apparatus retrieves the documents “2” and “8” each of which includes both words “Tokyo” and “Osaka” therein, with reference to the inverted index table. Then, the speech data retrieval apparatus determines that a document including the word string “Tokyo Osaka” therein is the document “2” because the words “Tokyo” and “Osaka” appear in a tenth position and an eleventh position in the document “2” and the words “Tokyo” and “Osaka” appear in a sixteenth position and a fourteenth position in the document “8”.
A non-patent document 1 (M. Saraclar and R. Sproat “Lattice-Based Search for Spoken Utterance Retrieval” Proc. HLT-NAACL, 2004) discloses a method for retrieving speech data by using an inverted index table. More specifically, the non-patent document 1 discloses a method for creating the inverted index table by using a lattice of spoken utterance created as a result of speech recognition in order to retrieve the speech data quickly. In this method, a spoken utterance is assumed as a document including a target word therein. Of course, in this method, a phoneme or a syllable may be set as a retrieval unit instead of setting a word as the retrieval unit.
A lattice is a graph that shows one or more words, phonemes or syllables to be one or more candidates forming a sentence of spoken utterance as a digraph (see FIG. 3). It is noted that numerals “0” to “11” and symbols “A” to “J” shown in FIG. 3 respectively represent node numbers and arc labels such as words, phonemes or syllables. A sequence of labels which are respectively assigned to arcs on one path from a leftmost starting node to a rightmost ending node indicates one hypothesis for a sentence of spoken utterance created as a result of speech recognition. In addition to the label, a weight representing likelihood of the label, a starting time and an ending time in a speech segment corresponding to the label are assigned to each arc. The lattice shown in FIG. 3 is stored in a speech data retrieval apparatus as tabular data shown in FIG. 4. A table shown in FIG. 4 represents the relation of connections between the nodes and the arcs in the lattice shown in FIG. 3.
In the method of the non-patent document 1, a speech data retrieval apparatus creates an inverted index table in which a pair of a label and a list of all arcs corresponding to the label is included (see FIG. 5). For example, if there is an arc e in a lattice, the speech data retrieval apparatus registers a data set (id[e], k[e], n[e], p(e|k[e]), f(k[e])) on a list of arcs corresponding to a label l[e] in an inverted index table, where a variable id[e] represents an utterance ID number assigned to the lattice in which the arc e is included, a variable k[e] represents a number of source node located in a source of the arc e, a variable n[e] represents a number of destination node located in a destination of the arc e, a variable p(e|k[e]) represents a probability that the arc e is selected from among arcs going out of the node k[e], a variable f(k [e]) represents a probability that the node k[e] appears in all paths of the lattice.
Values of the probabilities p(e|k[e]) and f(k[e]) are calculated based on weights of arcs in a lattice. In a case where a weight of path from one node to another node in a lattice is given by the product of weights of arcs on the path, the value of probability f(k[e]) is given by dividing the summation of weights of paths from a starting node to an ending node through the node k[e] in the lattice by the summation of weights of all paths from the starting node to the ending node. The value of probability p(e|k[e]) is given by dividing the summation of weights of paths from the node k[e] to the ending node through the arc e in the lattice by the summation of weights of all paths from the node k[e] to the ending node.
In a weighted directed acyclic graph, the summation α(v) of weights of all paths from a starting node to a given node v in the graph is effectively calculated according to Forward algorithm. The summation β(v) of weights of all paths from a given node v to an ending node in the graph is effectively calculated according to Backward algorithm.
First, we will describe the Forward algorithm below. The speech data retrieval apparatus simultaneously calculates the summation α(v) of weights of all paths from a starting node to a given node v in a graph G according to the Forward algorithm and the summation a of weights of all paths from the starting node to an ending node in the graph G. The Forward algorithm is as follows,
Forward (G)1  S←I2  Q←I3  for each qεI do4    α(q)=15  while S≠Φ do6    q←HEAD(S)7    DEQUEUE(S)8      for each eεE[q] do9      α(n[e])←α(n[e])+f(q)*w(e)10     If not n[e]εQ then11       Q←Q∪{n[e]}12       ENQUEUE(S, n[e])13 α←014 for each qεF do15   α←α+α(q)
where the graph G has a set V of nodes, a set E of arcs, a set I of starting node and a set F of ending node, a set E(v) of arcs is a set of arcs going out of a node v, k[e] is a source node of arc e, n[e] is a destination node of arc e, l[e] is a label of arc e, w[e] is a weight of arc e, HEAD(S) is a function for returning a head element of queue S, DEQUEUE(S) is a function for deleting a head element of queue S, and ENQUEUE(S, x) is a function for inserting an element x into an end position of queue S.
Next, we will describe the Backward algorithm below. The speech data retrieval apparatus simultaneously calculates the summation β (v) of weights of all paths from a given node v to a destination node in a graph G according to the Backward algorithm and the summation β of weights of all paths from the starting node to an ending node in the graph G. The Backward algorithm is as follows,
Backward (G)1  S←F2  Q←F3  for each qεF do4    β(q)=15  while S≠Φ do6    q←HEAD(S)7    DEQUEUE(S)8    for each eεH[q] do9      β(k[e])←β(k[e])+w(e)*β(q)10     if not k[e]εQ then11      Q←Q∪{k[e]}12      ENQUEUE(S, k[e])13  β←014  for each qεI do15   β←β+β(q)
where the graph G has a set V of nodes, a set H of arcs, a set I of starting node and a set F of ending node, a set H(v) of arcs is a set of arcs coming in a node v, k[e] is a source node of arc e, n[e] is a destination node of arc e, l[e] is a label of arc e, w[e] is a weight of arc e, HEAD(S) is a function for returning a head element of queue S, DEQUEUE(S) is a function for deleting a head element of queue S, and ENQUEUE(S, x) is a function for inserting an element x into an end position of queue S.
Thus, the summation of weights of paths from the starting node to the ending node through the node k[e] is given by α(k[e])*β(k[e]). The summation of weights of all paths from the starting node to the ending node is given by β (starting node). The summation of weights of paths from the node k[e] to the ending node through the arc e is given by w[e]*β(n[e]). The summation of weights of all paths from the node k[e] to the ending node is given by β (k[e]). Therefore, the values of the probabilities p(e|k[e]) and f(k[e]) are calculated by using the above-described values according to as follows,f(k[e])=α(k[e])*β(k[e])/β(starting node)p(e|k[e])=w[e]*β(n[e])/β(k[e])
If an arc string e1, e2, . . . , eM corresponding to a label string L1, L2, . . . , LM of query is found on a lattice of spoken utterance, an appearance probability is calculated according to as follows,P(e1, e2, . . . , eM)=f(k[e1])*p(e1|k[e1])*p(e2|k[e2])* . . . *p(eM|k[eM])
where the appearance probability P (e1, e2, . . . , eM) is a probability that an node k[e1] appears and a path representing the arc string e1, e2, . . . eM passes through an arc e1 from the node k[e1], an arc e2 from a node k[e2], . . . , and an arc eM from a node k[eM]. It is noted that n[em-1] is k[em] and l[em] is Lm(1≦m≦M) in the arc string e1, e2, . . . , eM. The summation of appearance probabilities of all arc strings corresponding to a label string of query on a lattice of spoken utterance becomes an appearance probability of the label string of the query in the spoken utterance.
In a process of retrieving an arc string corresponding to a label string of query, the speech data retrieval apparatus may assign each utterance ID number in a list of utterance ID numbers to an appearance probability of an arc string corresponding to a label string of query in a spoken utterance associated with the each utterance ID number, and then sort the list of utterance ID numbers, with reference to the assigned appearance probabilities. Further, the speech data retrieval apparatus may delete from a list of utterance ID numbers an utterance ID number to which a relatively low appearance probability is assigned.
Next, with reference to FIG. 6, we will describe a method for creating an inverted index table based on N lattices G1, . . . , GN for all spoken utterances to be retrieved.
In step S1, the speech data retrieval apparatus assigns “1” to arguments i and j. In step S2, the speech data retrieval apparatus calculates the summation α(k[ej]) of weights of all paths from a starting node to a source node k[ej] for an arc ej in a lattice Gi according to the Forward algorithm, and calculates the summation β(k[ej] of weights of all paths from the source node k[ej] to a destination node in the lattice Gi according to the Backward algorithm. In step S3, the speech data retrieval apparatus calculates a data set (id[ej], k[ej], n[ej], p(ej|k[ej]), f(k[ej])) for the arc ej in the lattice Gi. In step S4, the speech data retrieval apparatus registers the data set (id[ej], k[ej], n[ej], p(ej|k[ej]), f(k[ej])) on a list E(l[ej]) of arcs associated with a label l[ej] in an inverted index table of the lattice Gi. In step S5, the speech data retrieval apparatus determines whether or not the value of argument j is equal to the total number M of arcs in the lattice Gi. If the value of argument j is not equal to the total number M of arcs, the speech data retrieval apparatus carries out a process of step S6. If the value of argument j is equal to the total number M of arcs, the speech data retrieval apparatus carries out a process of step S7. In step S6, the speech data retrieval apparatus increments the value of argument j by one, and then returns to the process of step S2. In step S7, the speech data retrieval apparatus determines whether or not the value of argument i is equal to the total number N of lattices. If the value of argument i is not equal to the total number N of lattices, the speech data retrieval apparatus carries out a process of step S8. If the value of argument i is equal to the total number N of lattices, the speech data retrieval apparatus finishes the series of processes. In step S8, the speech data retrieval apparatus increments the value of argument i by one and assigns “1” to the argument j, and then returns to the process of step S2.
According to the above-described method, for example, the speech data retrieval apparatus creates the inverted index table shown in FIG. 5 from the lattice shown in FIG. 3.
Next, with reference to FIGS. 7 and 8, we will describe a method for effectively retrieving from an inverted index table a list of one or more utterance ID number assigned to one or more lattices in which an arc string matching a label string L1, . . . , LM of query is included, with respect to lattices G1, . . . , GN. First, with reference to FIG. 7, we will describe a method for retrieving one or more lattices in which all labels forming the label string L1, . . . , LM of the query are included, and retrieving a list of one or more utterance ID numbers assigned to the retrieved one or more lattices. It is noted that an order of appearances of labels is not considered in this method.
In step S11, the speech data retrieval apparatus assigns “1”, “1” and “2” to arguments i, j and k. In step S12, with respect to a label Li, the speech data retrieval apparatus obtains a list E(Li) of arcs from an inverted index table. In step S13, the speech data retrieval apparatus reads from the list E(Li) of arcs an utterance ID number id[ej] corresponding to an arc ej included in j-th data set, and registers the read utterance ID number id[ej] on a list Rij (1≦j≦S: S is the total number of data sets included in the list E(Li) of arcs). It is noted that the speech data retrieval apparatus deletes an utterance ID number duplicated in the list Rij. In step S14, the speech data retrieval apparatus determines whether or not the value of argument j is equal to the total number S of data sets included in the list E(Li) of arcs. If the value of argument j is not equal to the total number S of data sets, the speech data retrieval apparatus carries out a process of step S15. If the value of argument j is equal to the total number S of data sets, the speech data retrieval apparatus carries out a process of step S16. In step S15, the speech data retrieval apparatus increments the value of argument j by one, and then returns to the process of step S13. In step S16, the speech data retrieval apparatus determines whether or not the value of argument i is equal to the total number M of labels. If the value of argument i is not equal to the total number M of labels, the speech data retrieval apparatus carries out a process of step S17. If the value of argument i is equal to the total number M of labels, the speech data retrieval apparatus carries out a process of step S18. In step S17, the speech data retrieval apparatus increments the value of argument i by one and assigns “1” to the argument j, and then returns to the process of step S12.
In step S18, the speech data retrieval apparatus registers on an output list C one or more utterance ID numbers registered on the list R1j (1≦j≦S). In step S19, the speech data retrieval apparatus determines whether or not the value of argument i is “1”. If the value of argument i is “1”, the speech data retrieval apparatus finishes a series of processes. If the value of argument i is not “1”, the speech data retrieval apparatus carries out a process of step S20. In step S20, the speech data retrieval apparatus determines whether or not there are in the list Rkj (1≦j≦S) one or more utterance ID numbers identical to one or more utterance ID numbers included in the output list C. If there are not one or more utterance ID numbers identical to one or more utterance ID numbers included in the output list C, the speech data retrieval apparatus carries out a process of step S21. If there are one or more utterance ID numbers identical to one or more utterance ID numbers included in the output list C, the speech data retrieval apparatus carries out a process of step S22. In step S21, the speech data retrieval apparatus empties the output list C and finishes the series of processes. In step S22, the speech data retrieval apparatus deletes from the output list C an utterance ID number which is not identical to any utterance ID numbers included in the list Rkj. In step S23, the speech data retrieval apparatus determines whether or not the value of argument k is equal to the total number M of labels. If the value of argument k is not equal to the total number M of labels, the speech data retrieval apparatus carries out a process of step S24. If the value of argument k is equal to the total number M of labels, the speech data retrieval apparatus finishes the series of processes. In step S24, the speech data retrieval apparatus increments the value of argument k by one, and then returns to the process of step S20.
Next, with reference to FIG. 8, we will describe a method for determining whether or not an order of appearances of labels, which are retrieved according to the procedure of FIG. 7, matches one of labels of the query, by each of lattices in which all labels forming the label strings L1, . . . , LM of the query are included. It is noted that the summation of appearance probabilities of arc strings each of which is obtained as the result that the order of appearances of labels matches that of labels of the query is calculated in parallel in this method. More specifically, this method uses the fact that the summation of appearance probabilities becomes “0” when there are not the arc strings.
In step S31, the speech data retrieval apparatus assigns “1”, “1” and “1” to arguments i, j and m. In step S32, the speech data retrieval apparatus reads from the output list C a list Ej(Li) of arcs corresponding to a resister number j assigned to an utterance ID number included in the output list C in the order of increasing an integer number from “1”. In step S33, the speech data retrieval apparatus determines whether or not the value of argument i is equal to the total number M of labels. If the value of argument i is not equal to the total number M of labels, the speech data retrieval apparatus carries out a process of step S34. If the value of argument i is equal to the total number M of labels, the speech data retrieval apparatus carries out a process of step S35. In step S34, the speech data retrieval apparatus increments the value of argument i by one and then returns to the process of step S32. In step S35, the speech data retrieval apparatus calculates the following equation Fm(ejm)=f(k[ejm])*p(ejm|k[ejm]), with respect to an arc ejm included in each data set in the list Ej (Lm) of arcs. In step S36, the speech data retrieval apparatus determines whether or not the value of argument m is equal to the total number M of labels. If the value of argument m is equal to the total number M of labels, the speech data retrieval apparatus carries out a process of step S37. If the value of argument m is not equal to the total number M of labels, the speech data retrieval apparatus carries out a process of step S39. In step S37, the speech data retrieval apparatus determines whether or not the value of argument j is equal to the maximum value T of register number. If the value of argument j is not equal to the maximum value T of register number, the speech data retrieval apparatus carries out a process of step S38. If the value of argument j is equal to the maximum value T of register number, the speech data retrieval apparatus carries out a process of step S42. In step S38, the speech data retrieval apparatus increments the value of argument j by one and then returns to the process of step S32. In step S39, the speech data retrieval apparatus increments the value of argument m by one. In step S40, the speech data retrieval apparatus calculates the following equation
                    F        m            ⁡              (                  e          jm                )              =                  ∑                                            e              ∈                                                E                  j                                ⁡                                  (                                      L                                          m                      -                      1                                                        )                                                      :                          n              ⁡                              [                e                ]                                              =                      k            ⁡                          [                              e                jm                            ]                                          ⁢                                    F                          m              -              1                                ⁡                      (            e            )                          *                  p          ⁡                      (                                          e                jm                            |                              k                ⁡                                  [                                      e                    jm                                    ]                                                      )                                ,with respect to an arc ejm included in each data set in a list Ej(Lm) of arcs. The speech data retrieval apparatus calculates the above-described equation as Fm-1(e)=0 when Fm-1 is not calculated. In step S41, the speech data retrieval apparatus determines whether or not the value of argument m is equal to the total number M of labels. If the value of argument m is not equal to the total number M of labels, the speech data retrieval apparatus carries out the process of step S39. If the value of argument m is equal to the total number M of labels, the speech data retrieval apparatus carries out the process of step S42. In step S42, the speech data retrieval apparatus calculates a probability
      P    ⁡          (                        L          1                ,        …        ⁢                                  ,                  L          M                    )        =            ∑              e        ∈                              E            j                    ⁡                      (                          L              m                        )                                ⁢                  F        M            ⁡              (        e        )            that a label string L1, . . . , LM is included in an utterance j. In step S43, the speech data retrieval apparatus determines whether or not the probability P(L1, . . . , LM) is more than “0”. If the probability P(L1, . . . , LM) is more than “0”, the speech data retrieval apparatus carries out a process of step S44. If the probability P(L1, . . . , LM) is not more than “0”, the speech data retrieval apparatus carries out a process of step S45. In step S44, the speech data retrieval apparatus registers on a list S a pair of the utterance ID number and the probability P(L1, . . . , LM). In step S45, the speech data retrieval apparatus determines whether or not the value of argument j is equal to the maximum value T of register number. If the value of argument j is not equal to the maximum value T of register number, the speech data retrieval apparatus carries out the process of step S38. If the value of argument j is equal to the maximum value T of register number, the speech data retrieval apparatus finishes the series of processes.
In a conventional method for retrieving speech data, the speech data retrieval apparatus creates an inverted index table by means of a lattice obtained based on plural pieces of speech data registered in a speech database as a result of speech recognition. However, in this conventional method, there is a problem that the file size of inverted index table increases because the lattice includes redundant arcs therein. Further, the speech data retrieval apparatus can not retrieve a word string including adjacency between words which is not permitted in a language model used in the speech recognition because the lattice includes only adjacency between words which is permitted in the language model, with respect to a candidate word. Therefore, there is a problem that a retrieval performance deteriorates in the conventional method.