1. Technical Field
The present invention relates to an improved data processing system. In particular, the present invention relates to using fast semi-automatic semantic annotation to train initial parser in a data processing system. Still more particularly, the present invention relates to using fast semi-automatic semantic annotation to train initial parser in a statistical spoken dialog system or statistical text processing system.
2. Description of Related Art
A natural language understanding system is a media or tool which facilitates communications between human and machine. For example, part of a natural language understanding system, such as a statistical spoken dialog system, includes conversations between two people and a collection of sentences necessary for a conversation. From these conversations, real application data may be collected.
Currently, two main approaches in building natural language understanding systems are present. These approaches are grammar-based approach and corpus-driven approach. The grammar based approach requires either a grammarian or a domain expert to handcraft a set of grammar rules. These grammar rules capture the domain specific knowledge, pragmatics, syntax and semantics. The corpus-driven approach employs statistical methods to model the syntactic and semantic structure of sentences. The task of defining grammar rules is replaced by a simpler task of annotating the meaning of a set of sentences. This approach is more desirable, because induced grammar can model real data closely. Some grammar induction algorithms can automatically capture patterns in which syntactic structures and semantic categories interleave into a multitude of surface forms. In building natural language understanding systems, collection of a “mini-corpus” of 10000 to 15000 sentences is a necessary step using either the grammar-based approach or the corpus-driven approach.