The exemplary embodiment relates to the field of pipeline system processing and finds particular application in a system and method for predicting and addressing errors in individual components of a pipeline system.
Pipeline processing is a common technique in computing since its development in the 1970's. See, for example, D. M. Ritchie, “The evolution of the unix time-sharing system,” Communications of the ACM, 17:365-375, 1984. The idea behind the technique is that complex processing can be achieved by decomposing a process into a series of more basic components, each performing part of the process. In some cases this can produce a more intricate output than would have been possible with a single method. It has been used, for example, in Natural Language Processing (NLP) applications, such as named entity recognition (Ritter, et al., “Named entity recognition in tweets: An experimental study,” Proc. 2011 Conf. on Empirical Methods in Natural Language Processing, pp. 1524-1534 (July 2011)), text summarization (Ly, et al., “Product review summarization from a deeper perspective,” Proc. 11th Annual Intern'l ACM/IEEE Joint Conf. on Digital libraries, JCDL '11, pp. 311-314 (2011)), and in recognizing textual entailment (Finkel, et al., “Solving the problem of cascading errors: approximate bayesian inference for linguistic annotation pipelines,” Proc. 2006 Conf. on Empirical Methods in Natural Language Processing, EMNLP '06, pp. 618-626 (2006)). For example, comment or opinion summarization systems may make use of a pipeline-like architecture in which a first component filters out spam comments and then a second component categorizes the comments into aspects. In identifying evaluative sentences, MacCartney, et al. proposes a three stage approach to textual inference: linguistic analysis (which is a pipeline itself), followed by graph alignment, ending with determining an entailment (see, MacCartney, et al., “Learning to recognize features of valid textual entailments,” Proc. Main Conf. on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL '06, pp. 41-48, (2006)). Software architectures are available for building NLP pipelines, such as GATE (see, Cunningham, et al., “Software infrastructure for natural language processing,” Proc. 5th Conf. on Applied Natural Language Processing, ANLC '97, pp. 237-244 (1997)). In the case of classification, running two binary classifiers in a series can result in improved results over a more complex multi-class classification approach (see, Lamb, A., Paul, M. J., Dredze, M., “Separating fact from fear: Tracking flu infections on Twitter,” Proc. NAACL-HLT pp. 789-795, 2013).
One problem with a pipeline approach is that when errors occur, it is difficult to identify the root cause. This is because when data have been processed through a pipeline of components, there may only be access to partial feedback. That is, an input X goes through a series of components that ultimately results in an output Y. Each component in the processing pipeline performs some action on X, and each of the components may result in an error. However, the user often only has access to the final output, and so it is unclear which of the components was at fault when an error is observed in the final output. While in some cases, a user may be able to provide feedback with respect to at least some of the components, this may entail much more work on the user's part and may also be prone to inaccuracies if it is difficult for the user to identify the source of errors.
There remains a need for a system and method for predicting the root cause(s) of errors in a pipeline, given information that an error has occurred or not, and the input and output data.