A wide range of information processing systems for commercial and research purposes operate on linguistic data. Such systems, including NLP systems, typically take in information and apply linguistic processing techniques to it, such as dividing a text into units of words, dividing a text into units of sentences, and performing grammatical analysis on a text to determine the parts of speech (POS) of the words.
The input to such a process might be a buffer of text, a file of formatted text, or a set of data that results from some prior analysis. The input text might be in any of a number of languages or formats, and the appropriate steps to process the text can vary depending on the language and format.
Conventional systems use a number of strategies for processing this information, including hard-coded, loosely coupled, and very loosely coupled configurations. The hard-coded systems utilize hard-coded combinations of processing modules and algorithms to perform the required processing. Such systems can operate at high speed, but are typically cumbersome to maintain, extend, configure and update. Adding a new language, or a new analytical operation, can require extensive re-engineering and complex changes to the deployed system.
Very loosely coupled systems operate by dividing the various operations into independent modules that interact via files and/or databases. These systems are easier than hard-coded systems to extend and configure, but are slow, since the data must be read and written from slow storage media at each step in the process.
Loosely coupled systems operate by sharing data in memory between operations, but use formats in memory that require all the operations to be written in a common programming language, or require extensive parsing operations. XML systems, for example, can exhibit both of these characteristics: if the XML is stored in memory as a character string, then each processing module must parse the string before processing any data. If it is stored as a complex data structure (such as a DOM), then all the modules must be written in a single runtime environment to access the data.
Conventional NLP approaches may also be hampered by character encoding schemes. In particular, many existing text processing systems use legacy character encodings to represent text. This restricts these systems to handling one or a small number of languages at a time. Few systems use Unicode, which permits the system to operate on text in any language.
Several existing systems, particularly in the area of Artificial Intelligence, allow software modules to evaluate their applicability to particular data or to “bid” on input data.
A number of architectures have been proposed or implemented for NLP, including, for example, those described in the following papers, each of which is incorporated herein by reference as if set forth herein in its entirety:
Grishman, TIPSTER Text Architecture Design, Version 3.1., October, 1998, which describes a pipeline architecture; and
Cunningham, Software Architecture for Language Engineering [“SALE”], Ph.D. Thesis, University of Sheffield, June 2000, describing, among other aspects, what the author refers to as a General Architecture for Text Engineering (“GATE”).
The TIPSTER, SALE and GATE disclosures relate to pipeline architectures, which stand in contrast to the blackboard architecture described below in connection with the present invention.
NLP and NLE are knowledge intensive endeavors, requiring the use of linguistic and extra-linguistic information such as lexica, language models, grammars, and script features. Substantially all non-trivial NLP systems or applications use these types of knowledge to one degree or another.
Accordingly, it would be desirable to provide a linguistic processing platform architecture that can support multiple knowledge sources in a consistent, extensible way, and which facilitates the addition of new or updated sources to an existing NLP system.
Few NLE problems are monolithic; instead, there are typically distinct tasks that support or feed into one another. For example, a part of speech (“POS”) tagger typically requires that sentence boundaries have been disambiguated, which often in turn requires that the original text be tokenized. (Examples of such aspects are set forth in commonly owned U.S. patent application Ser. No. 10/883,038 filed Jul. 1, 2004, entitled Method & System for Language Boundary Detection, incorporated by reference herein in its entirety.)
In addition, the tasks themselves may be quite complex; minimalism is neither required nor optimal in many cases. For example, tokenization is a relatively simple task for most languages (with the significant exception of ambiguous punctuation), but is quite difficult for Chinese, Japanese, That or similar languages.
Accordingly, it would be desirable to provide a linguistic processing platform architecture that can support decomposition without requiring it.
It would also be desirable to provide an architecture that enables and facilitates the addition of new linguistic tools to the platform. Thus, for example, not only should it be simple for designers and implementers to add new linguistic components into the framework of the linguistic processing system, but it should also be simple for the designer to provide new or updated components to existing users and customers, which can be “dropped into” existing applications.
It would also be desirable to provide such an architecture that enables designers to provide a substantially complete package of language processing utilities and applications, wherein features can be activated, on a per-feature basis, using an external configuration file.
Still further, it would be desirable to be able to update data sources without requiring users to recompile their libraries.
In addition, it would be desirable to provide a core architecture that is multilingual, or, more particularly, human-language neutral, in that it does not rely on particular scripts or language features. Those skilled in this area of technology will appreciate that NLE components are typically assigned significant computational burdens, and ideally should not have to contend with the vagaries of character encoding and the like.
It would also be desirable to provide an architecture which provides input/output neutrality, in that the NLE components need not be concerned with how text is provided to the system, such that no assumptions are made about the input format, not does the architecture concern itself with the format of output information.