This invention relates to pattern recognition systems. More particularly, the invention relates to determinization and minimization in natural language processing.
Finite-state machines have been used in many areas of computational linguistics. Their use can be justified by both linguistic and computational arguments. Linguistically, finite automata, which are directed graphs having states and transitions, are convenient since they allow one to describe easily most of the relevant local phenomena encountered in the empirical study of language. They often provide a compact representation of lexical rules or idioms and cliches which appears as natural to linguists. Graphic tools also allow one to visualize and modify automata. This helps in correcting and completing a grammar. Other more general phenomena such as parsing context-free grammars can also be dealt with using finite-state machines. Moreover, the underlying mechanisms in most of the methods used in parsing are related to automata.
From the computational point of view, the use of finite-state machines is mainly motivated by considerations of time and space efficiency. Time efficiency is usually achieved by using deterministic algorithms that reduce redundancy in the automata. The output of deterministic machines depends, in general linearly, only on the input size and can therefore be considered as optimal from this point of view. Space efficiency is achieved with classical minimization algorithms that collapse some nodes in the determinized automata in a reverse determinizing manner. Applications such as compiler construction have shown deterministic finite-state automata to be very efficient in practice. Finite-state automata also constitute now a rich chapter of theoretical computer science.
Their recent applications in natural language processing which range from the construction of lexical analyzers and the compilation of morphological and phonological rules to speech processing show the practical efficiency of finite-state machines in many areas.
Both time and space concerns are important when dealing with language. Indeed, one of the trends which clearly come out of new studies of language is a large increase in the size of data. Lexical approaches have shown to be the most appropriate in many areas of computational linguistics ranging from large-scale dictionaries in morphology to large lexical grammars in syntax. The effect of the size increase on time and space efficiency is probably the main computational problem one needs to face in language processing.
The use of finite-state machines in natural language processing is certainly not new. Indeed, sequential finite-state transducers are now used in all areas of computational linguistics. Limitations of the corresponding techniques, however, are very often pointed out more than their advantages. This might be explained by the lack of description of recent works in computer science manuals.
It is therefore an object of this invention to provide a pattern recognition system that applies determinization means to achieve time efficiency and minimization means to achieve space efficiency.
It is a more particular object of this invention to provide the speech recognition system that allows pruning of the automata by determinization without loss of information and better control of the result while taking account the word content represented by weights applied to automata.
It is another object of this invention to provide the speech recognition system that also allows pruning of the automata by minimization without loss of information while using equivalent functions as determinization.
It is yet another object of this invention to provide the speech recognition system that further allows testing of given automata to determine whether it can be determinized by a determinizability test.