Information extraction (IE) technology has received increasing amounts of attention over the past several years. In contrast to information retrieval (IR) technology which is concerned with the problem of automated document retrieval, IE technology is concerned with the problem of automated information retrieval. IE technology also differs fundamentally from the problem of solving the full-blown natural language understanding problem. The general natural language understanding problem is concerned with developing computer systems which have a “deep” understanding of a text. In contrast, IE technology is not concerned with the problem of trying to understand all information conveyed in a text. IE technology is concerned with simply attaining a “partial understanding” of a text for the purpose of extracting specific information. IE technology can be applied in a range of situations (e.g., as an advance technology technique for searching the web and e-mail, to assist the performance of speech recognition systems and language translation, automatic extraction of information from bus schedules, automatic extraction of information regarding a particular disease from medical documents, grading student exam essays, etc.).
Theories and experiments in the field of text comprehension have often required mapping recall, summarization, talk aloud, and question-answering protocol data into a semantic model of the implicit and explicit information in text clauses. This semantic model of the information in the text clauses has been referred to as the text based microstructure. Typically, this initial coding procedure of mapping the protocol data into a text based microstructure is done using human coders. Inter-coder reliability measures are then used to establish the reliability of the coding procedure.
This widely used coding procedure methodology, however, has several problems. First, the coding procedures are typically not well documented. Second, the reliability of the procedures is often highly dependent upon “human coders”, who despite their best intentions, are prone to inconsistent coding behaviors (especially over very large coding tasks). Third, such coding procedures are typically not readily accessible to other researchers. And fourth, coding procedures across research labs located in different parts of the world are not standardized in any particular fashion.
An ideal solution to these problems would be to develop an automated approach to coding human protocol data. Although important progress in this area has been made, additional work is required. It should be emphasized that the task of coding human protocol data is not nearly as complex as the full-fledged natural language understanding problem. Consider a typical experiment where a group of human subjects are asked to recall the same story from memory. Although the resulting protocol data will be extremely rich and varied, typically the text comprehension researcher is only interested in detecting a relatively small number of propositions. This dramatically simplifies the pattern recognition problem. Thus, there is needed a new theoretical framework for worldwide remapping human protocol data into a text based microstructure.