Finite state networks find extensive use in diverse applications including, for example, natural language processing for automatic speech recognition, automatic speech generation, document spell-checking and spell-correction, morphological analysis and generation, and so forth. A finite state network is a directed graph consisting of a finite set of states connected by zero or more labeled transitions (also called “edges” or “arcs”); a state is typically designated as the start state, and zero or more states are designated as final states. For useful applications, a finite state network typically includes at least one arc, and may include tens, hundreds, thousands, hundreds of thousands, or more arcs. By applying a suitable finite state network to an input (typically a string, i.e., a concatenation of symbols), various functions can be performed. For example, a single-level finite state network having a single label per arc can be used as an acceptor—if at least one path of the finite state acceptor network, from the start state to a final state, traverses labeled arcs whose concatenated labels match the input string, then the input is accepted; otherwise, it is rejected. As another example, a two-level finite state network (wherein each arc has an upper-side label and a lower-side label) can be used as a transducer—arbitrarily treating the upper side as the input side, the matching of an input string to the upper-side labels of a path results in the output of one or more strings being the concatenation of the lower-side labels along the matching path or paths, thus mapping the input string of symbols to a set of output strings of symbols.
In addition to having a label on each arc, weighted finite state networks further include a weight for each arc of the network, including an understood “exit arc” leaving each final state. In some cases, the weight of an arc may be a neutral or identity weight. The weight of a network path, from the start state to a final state, is computed by combining the weights on the arcs (including the weight of the exit arc of the final state) constituting the path using an extension operation, such as multiplication. The combined weight of a set of paths is computed by combining the weights of the paths using a collection operation, such as addition. A suitable weighting paradigm, represented for example as a semiring object, defines the set of allowable weights, the extension and collection operations, and so forth. In the standard notation, a semiring is an ordered five-tuple specifying the set of allowable weights, the collection and extension operations, and the identity values for the collection and extension operations, respectively.
One example semiring is: <R≧0,+,x,0,1> where R≧0 denotes a set of weights corresponding to all real numbers greater than or equal to zero, “+” denotes the collection operation which is addition in this semiring, “x” denotes the extension operation which is multiplication in this semiring, zero is the identity value for collection (A+0=A), and 1 is the identity value for extension (A×1=A). This semiring is sometimes called the “real-plus-times” semiring. If weights (w) are kept in the range 0>=w>=1, a probabilistic interpretation of these weights may be applicable. A similar semiring, called the “integer-plus-times” semiring, is constructed by replacing the floating value weights set R≧0 with the set of natural integers greater than or equal to zero.
As another illustrative example semiring, <R≧0∪{∞},min,+, ∞,0> is the real tropical semiring where min denotes the collection operation which selects the path with the minimum weight from a set of paths, “+” denotes addition as the extension operation, ∞ (infinity) is the identity value for collection operation (min{A, ∞}=A), and zero is the identity value for the extension operation (A+0=A). When using this real tropical semiring, the weights are typically negative logarithms of probabilities, interpreted as “costs,” so that the additive extension operation is functionally equivalent to multiplying probabilities, and the min collection operation selects the path with the minimum cost amongst those paths matching the input. An advantage of the tropical semiring is that performing addition operations is typically faster than performing multiplication operations, making the tropical semiring computationally efficient.
To facilitate the use of weighted finite state networks, a weighted finite state network library is typically provided. The finite state network library includes various functions or components configured to create, store, minimize, apply, or otherwise process or manipulate finite state network objects stored in a selected network object storage format. Typically, the library includes functions to perform various network combining operations for selectively combining finite state networks. For example, a finite state network library typically includes: a union operation for performing a union of input finite state networks; a concatenation operation for concatenating input finite state networks; an intersection operation for determining an intersection of input finite state networks; and so forth. In addition, a typical library will include algorithms to apply finite state networks to input and retrieve the output or outputs. Useful applications such as tokenizers, spell-checkers, spell-correctors, morphological analyzers or so forth are constructed at least in part by invoking selected components of the weighted finite state network library.
Existing implementations of weighted finite state networks typically employ numerical weights, such as integer weights or floating point weights. The art has recognized that non-numerical weights could also be useful for certain applications. For example, a proposed language-union-concatenation semiring is suitably defined by the five-tuple: <2Σ*, ∪, ▪, Ø, {ε}>, where 2Σ* denotes the set of all languages over an alphabet Σ(where “language” is a term of art denoting a set of strings), ∪ denotes union of languages, ▪ denotes concatenation of languages, Ø denotes the empty language, and {ε} denotes the language containing only the empty string. The language-union-concatenation semiring uses languages as weights, and hence is applicable in natural language processing and similar applications. As another example, finite state networks employing feature sets as weights are disclosed in Amtrup et al., “Morphology in Machine Translation Systems: Efficient Integration of Finite State Transducers and Feature Structure Descriptions”, Machine Translation, vol. 18(3), pp. 217-238 (2003).
Although weighted finite state networks including non-numerical weights have recognized applications, existing techniques for implementing such non-numerically weighted finite state networks have disadvantages. For example, Amtrup discloses storing feature sets as weights using a custom bit-based representation of the features. Weight combinations, in this case to perform feature unification, are implemented in a bitwise fashion on the custom bit-based representation. Such a custom representation of non-numerical weights has numerous disadvantages, such as requiring additional library components or functions to implement low-level weight processing in accordance with the selected weight representational format, consequent increase in the size and complexity of the weighted finite state network library, inflexibility in modifying the weightings for different applications, difficulty in extending existing weightings to other types of weighting (for example, there is no straightforward way to modify an implementation of Amtrup's feature sets weighting to provide a language-union-concatenation based languages weighting), and so forth.