The present invention generally relates to a method and apparatus for generating data in a processor usable form from input data in the form of units in a natural language. In particular the present invention relates to an interface between a natural language input and a processor usable data output.
With the continuing advancement of computers and data processing equipment, wherein the operation of the equipment has become evermore complicated, there is a need to simplify the user interface to provide a more xe2x80x9cuser friendlyxe2x80x9d interface which can used intuitively thereby reducing the training required by the user.
When a user wishes a computer or data processor apparatus to carry out an operation, an instruction must be entered by the user. The use of natural language as a method of entering instructions or data has been investigated for example in EP-A-0118187 the content of which is hereby incorporated by reference.
When natural language is used instead of simply entering key words, it is necessary for the meaning of the entered natural language instructions to be determined. The interpretation of the meaning of natural language input requires the use of a parser. Parsers can be roughly divided into two types. Parsers which use a full linguistic analysis based on theoretical approaches have been dominant in the 1980""s because this approach appeared to offer the reward of extracting a much more detailed linguistic analysis. However, such systems have disadvantages, in particular, such parsers are difficult to design and maintain and they require large computational resources. Examples of parsers which use full linguistic analysis are disclosed in GB-A-2269923 and EP-A-0737928 the contents of which are hereby incorporated by reference.
Another form of parser which is used by the natural language community concerned with building systems is a simpler language processing technology which draws on finite state technology and extensions to it. In recent years finite state technology has reasserted itself in the natural language processing community and recent research in the finite state language processing field is described in a paper entitled xe2x80x9cDeterministic Part-of-Speech Tagging with Finite-State Transducersxe2x80x9d (E. Roche and Y. Schabes; Computational Linguistics 21(2) Pages 227 to 253) the content of which is hereby incorporated by reference. The emergence of finite state technologies has been driven partly by the limitations of the heavy-duty linguistic approach, partly by the need to process very large volumes of free text, and partly because of a result of a greater understanding of how to make effective finite-state language component.
A finite-state parser is described in a paper entitled xe2x80x9cPartial Parsing via Finite-State Cascadesxe2x80x9d (S. Abney; Proceedings of the ESSLLI ""96 Robust Parsing Workshop 1996) the content of which is hereby incorporated by reference. It is acknowledged in this paper that the finite state parsing technique is able to extract and output the linguistic structure of the input text efficiently. Thus although the parser is able to provide syntactic structure, it does not provide semantic information.
It is an object of the present invention to provide a parser which uses multi-level processing to extract syntactic and semantic information from input natural language to output data in a processing usable form as indexed variable.
One aspect of the present invention thus provides a method and apparatus acting as an interface between a natural language input and a processor usable output. The input natural language is in the form of units which are categorized into a plurality of different categories. The category of each data unit is determined to generate unit category data. The unit category data is then input into a pipeline or cascaded processing array in which the unit category data of the input data is matched with patterns of unit category data. Where a pattern is matched, a group category data is output. Subsequent processing stages in the pipeline or cascade can use input unit category data which has not been previously matched at any previous stage and group category data generated at any previous stage which has not been previously matched at any previous stage to match with a predetermined pattern of unit and/or group category data to generate new group category data. In this way groups of units and/or groups are grouped together at successive stages in the cascade. At each stage in the cascade when a match is found, variables are output corresponding to input data units. At least some of the variables are indexed by other variables in order to identify the modification relationship between the input data units as identified by the various stages of the parsing process.
In accordance with this aspect of the present invention, by outputting indexed variables, as matches are found, it is possible to determine not only the syntactic structure of the input natural language, but also to determine semantic information in the form of the modification relationships between the input natural language units i.e. words. The indexing acts as pointers to identify natural language units modifying other natural language units.
In one embodiment the multi-level pipeline or cascaded processing is implemented using finite-state machines, where each stage comprises finite-state machines implementing a particular grammar rule as a set of transitions. The finite-state machines are preferably deterministic in order to reduce the computational resources required in order to implement the finite state machines. However, non-deterministic finite-state machines can also be implemented within the scope of the present invention although they require greater processing resources.
The types of variables identify at least a head unit of a segment of input data corresponding to a segment which matches a predetermined pattern, where the head of a segment does not modify any other unit in the segment, and a modifier which modifies either a head or another modifier. To increase the level of semantic information available in the variables, different types of variables (modifiers) can be used to identify different modification relationships between units. Further, in order to identify the unit which is being modified by the modifier, at least some of the variables are indexed. The content of each variable is a corresponding data unit content and the indexing is achieved by indexing the variables by the data unit content of the data unit being modified.
The indexing of the variables takes place at the stage of processing at which the variable is generated so long as there is no ambiguity in the natural language input. Where there is ambiguity, the natural language grammar rules can be written to accommodate such ambiguity and this results in the use of indexed variables which are indexed using variables generated at an earlier stage in the processing i.e. variables are generated at an initial stage without indexing and are indexed later in the stages.
In an embodiment, the input data comprises words in a natural language. This is processed by a lexical processor with reference to a lexicon containing lexical units and corresponding parts of speech data. Output from the lexical processor are lexical units which match the input words together with the corresponding parts of speech data. The parts of speech data can be used directly as the unit category data. However, because the lexical processor performs no context analysis, there can be errors in the parts of speech assignment. Therefore, in an embodiment the output of the lexical processor is input to a parts of speech tagger which performs a statistical context analysis in order to more correctly assign parts of speech to the lexical units.
The output data generated by the present invention can be used to control the operation of a system. The variables generated from the input data can be compared with variables generated from reference data starting from a variable indicated to be the head of the input data or reference data in accordance with relationships defining equivalence between the variables. The system can then be controlled in accordance with the result of the comparison. Such a system can for example be a data base retrieval system wherein the input data comprises a natural language query and the reference data comprises natural language keys associated with the data in the data base. When a match is found between the query and the key, this can be indicated to the operator and the data can be retrieved.