Evaluation of linguistic or natural language processing ("NLP") applications, e.g. spell checker, grammar checker, etc., plays an increasingly important role in both the academic and industrial natural language communities. Specifically, the growing language technology industry needs measurement tools to allow researchers, engineers, managers, and customers to track development, evaluate and assure quality, and assess suitability for a variety of applications. Currently, two tools are used for evaluating and testing NLP applications, namely, test suites and test corpora. Test suites can generally be described as focused data sets made up by researchers, etc. for testing a specific aspect of a NLP application while test corpora can generally be described as naturally occurring sets of text.
One specific approach for evaluating NLP applications is discussed in a paper entitled "TSNLP-Test Suites for Natural Language Processing" by Lehmann et al., published on Jul. 15, 1996, which paper is incorporated herein by reference in its entirety. The TSNLP approach is based on the assumption that, in order to yield informative and interpretable results, any test items used for an actual test or evaluation must be specific to the application and the user since every NLP application (whether commercial or under development) exhibits specific features which make it unique and every user (or developer) of a NLP system has specific needs and requirements. The TSNLP approach is also guided by the need to provide test items that are easily reusable.
To achieve these two goals of specificity and reusability, the TSNLP paper suggests the abandonment of the traditional notion of test items as a monolithic set in favor of the notion of a database in which test items are stored together with a rich inventory of associated linguistic and non-linguistic annotations. The test item database thus serves as a virtual database that provides a means to extract relevant subsets of the test data suitable for some specific task. Using the explicit structure of the data and given TSNLP annotations, the database engine allows for the searching and retrieving of data from the virtual database, thereby creating a concrete database instance according to arbitrary linguistic and extra-linguistic constraints.
To provide for the control over the test data when performing an evaluation of an NLP application, the TSNLP paper emphasizes the value of using test suites in lieu of test corpora since test suites provide the ability to focus on specific linguistic phenomena. This focus is particularly achieved by following the requirement that as many linguistic parameters as possible within the test suite be kept under control. For example, since vocabulary is a controllable linguistic parameter, the TSNLP approach requires the restriction of vocabulary in size as well as domain. Additionally, the TSNLP approach attempts to control the interaction of phenomena by requiring that the test items be as small as possible.
The TSNLP paper also suggests the desirability of providing progressivity that is the principle of starting from the simplest test items and increasing their complexity. In the TSNLP approach, this aspect is addressed by requiring that each test item focus only on a single phenomenon that distinguishes it from all other test items. (For each phenomenon within a test item the application under test should generate a phenomenon response, e.g., for each misspelled word within a sentence a spell checker should generate a list of alternative word suggestions). In this manner, test data users apply the test data in a progressive order resulting in the special attribute presupposition in the phenomenon classification.
While the approach for evaluating NLP applications as taught in the TSNLP paper does work for its intended purpose, the above-noted requirements cause the TSNLP approach to suffer the disadvantage of not allowing for the efficient testing of real user sentences with multiple errors on a large scale. In addition, since the base TSNLP approach only provides for queries that tally failures, the TSNLP approach for evaluating NLP applications provides information which may not completely reflect the behavior of the NLP application. For example, a test suite comprising "This are a test." may produce an actual result of "This is an test." when utilized as an input to an NLP application which, utilizing the TSNLP approach, would result in a flagged Subject-Verb failure without alerting the developer that the NLP application had a failed A/An correction and a bad rewrite. This inability to track uncommon patterns in the behavior of an NLP application on a more granular level renders the TSNLP approach for evaluating NLP applications susceptible to minor changes in the output of the underlying NLP application. Accordingly, the TSNLP a approach still requires an undesirably large amount of resources and time to identify and fix individual symptom bugs in an NLP application.
These deficiencies are also found in another tool for tracking problems found when evaluating NLP applications, dubbed "RAID", which has been used internally within Microsoft. Specifically, RAID similarly requires that each test item focus only on a single phenomenon which distinguishes it from all other test items. This is required because the database scheme and associated simple querying method implemented in RAID fails to allow for the tracking of complex relationships between system bugs, which the user sees, and underlying product bugs. Accordingly, RAID likewise suffers the disadvantage of not allowing for the efficient testing of real user sentences with multiple errors on a large scale. Furthermore, the base implementation of RAID also is limited to queries that tally failures which, as discussed previously, renders this method of evaluating NLP applications highly susceptible to minor changes in the output of underlying NLP application(s).