This invention relates to pattern recognition systems. More particularly, the invention relates to determinization and minimization in natural language processing.
Finite-state machines have been used in many areas of computational linguistics. Their use can be justified by both linguistic and computational arguments. Linguistically, finite automata, which are directed graphs having states and transitions, are convenient since they allow one to describe easily most of the relevant local phenomena encountered in the empirical study of language. They often provide a compact representation of lexical rules or idioms and cliches which appears as natural to linguists. Graphic tools also allow one to visualize and modify automata. This helps in correcting and completing a grammar. Other more general phenomena such as parsing context-free grammars can also be dealt with using finite-state machines. Moreover, the underlying mechanisms in most of the methods used in parsing are related to automata.
From the computational point of view, the use of finite-state machines is mainly motivated by considerations of time and space efficiency. Time efficiency is usually achieved by using deterministic algorithms that reduce redundancy in the automata. The output of deterministic machines depends, in general linearly, only on the input size and can therefore be considered as optimal from this point of view. Space efficiency is achieved with classical minimization algorithms that collapse some nodes in the determinized automata in a reverse determinizing manner. Applications such as compiler construction have shown deterministic finite-state automata to be very efficient in practice. Finite-state automata also constitute now a rich chapter of theoretical computer science.
Their recent applications in natural language processing which range from the construction of lexical analyzers and the compilation of morphological and phonological rules to speech processing show the practical efficiency of finite-state machines in many areas.
Both time and space concerns are important when dealing with language. Indeed, one of the trends which clearly come out of new studies of language is a large increase in the size of data. Lexical approaches have shown to be the most appropriate in many areas of computational linguistics ranging from large-scale dictionaries in morphology to large lexical grammars in syntax. The effect of the size increase on time and space efficiency is probably the main computational problem one needs to face in language processing.
The use of finite-state machines in natural language processing is certainly not new. Indeed, sequential finite-state transducers are now used in all areas of computational linguistics. Limitations of the corresponding techniques, however, are very often pointed out more than their advantages. This might be explained by the lack of description of recent works in computer science manuals.
It is therefore an object of this invention to provide a pattern recognition system that applies determinization means to achieve time efficiency and minimization means to achieve space efficiency.
It is a more particular object of this invention to provide the speech recognition system that allows pruning of the automata by determinization without loss of information and better control of the result while taking account the word content represented by weights applied to automata.
It is another object of this invention to provide the speech recognition system that also allows pruning of the automata by minimization without loss of information while using equivalent functions as determinization.
It is yet another object of this invention to provide the speech recognition system that further allows testing of given automata to determine whether it can be determinized by a determinizability test.
These and other objects of the invention are accomplished in accordance with the principles of the present invention by providing a system and method for optimal reduction of redundancy and size of a weighted and labeled graph. The system and method of the present invention presents labeling the graph by assigning contents to arcs in the graph, evaluating weights of each path between states by examining relationship between the labels, assigning the weight to each arc and determinizing the graph by removing redundant arcs with the same weights. The system and method may further provide the step of minimizing the graph by collapsing a portion of the states in the graph in a reverse determinizing manner.
In another aspect of the invention, the system and method of the present invention may be applied for use in a language processing. Such system and method presents receiving speech signals, converting the speech signals into word sequences, evaluating the probability that each word sequence would be spoken, interpreting the word sequences in terms of the graph having its arcs labeled with the word sequences and its path weighted with the evaluated probabilities and determinizing the graph by removing redundant word sequences in the graph during the interpretation. The system and method may further present minimizing the graph by collapsing a portion of the states in a reverse determinizing manner during the interpretation.
Further features of the present invention, its nature and various advantages will be more apparent from the accompanying drawings and the following detailed description of the preferred embodiments.