Technical Field
This disclosure relates generally to development and testing of computer software components for Natural Language Processing (NLP)-based information systems.
Background of the Related Art
The Unstructured Information Management Architecture (UIMA) is a specification that standardizes a software system framework for performing complex content analytics on unstructured data. The main idea of UIMA is that a document is submitted to a pipeline that is comprised of an ordered set of Annotators and Controllers. Each Annotator is invoked sequentially or in parallel, providing annotations on the content and recording them along the way in the document. Optionally, an Annotator can use the results of other Annotators (that have been executed before it in the pipeline), and adds its own output, if any, to the collective data set for further operations. UIMA pipelines vary in size and complexity, and it is not uncommon to have pipeline of substantial size that requires a large amount of time to initialize and run. For example, the Natural Language Processing (NLP)-based Question/Answering system IBM® Watson may contain up to 300 Annotators and takes several minutes or more to initialize and run. Such computational requirement causes productivity problems for Annotator developers who must restart the pipeline every time they make a change, no matter how small, to a particular Annotator. As a result, a developer typically works on one or more parts of the pipeline, spending time to create and configure an environment suitable for testing his or her particular change. Beside the time needed to maintain these partial pipelines, developers have had mixed results in the efficiency of these test environments. Moreover, developers still must complete their testing with a full pipeline to be confident of the final results. Current development and testing systems do not adequately satisfy these needs.