Many U.S. students graduate from high school without having acquired the reading skills needed to successfully participate in today's high-tech knowledge economy (Kirsch, Braun, Yamamoto & Sum, 2007). The No Child Left Behind legislation (NCLB, 2001) was designed to help states address this problem. The legislation requires educators to develop challenging academic content standards that clearly describe what students at different grade levels (GLs) are expected to know and be able to do in core academic areas like reading and math. The legislation also requires states to develop end-of-year accountability assessments that provide valid and reliable information about students' progress towards achieving those standards.
This application addresses an important component of the NCLB assessment development process: locating (or creating) reading passages that are closely aligned with the reading standards specified for students at successive grade-levels (GLs). The approach builds on previous research described in Sheehan, Kostin, Futagi, Hemat & Zuckerman (2006) and Sheehan, Kostin & Futagi (2007a, 2007b). That research culminated in the development of an automated text analysis system designed to help test developers locate appropriately targeted stimulus materials for use on different types of verbal reasoning assessments. Since the reading passages included on high-stakes assessments (e.g. those that factor into determining college admissions) are frequently adapted from previously published texts called sources, this system is called SourceFinder.
SourceFinder's existing text analysis modules account for differences in the source requirements specified for different types of assessments by evaluating candidate source documents multiple times. Results are then communicated to users via a set of acceptability ratings defined such that each individual rating reflects the acceptance criteria specified for a particular type of passage associated with a particular assessment program.
SourceFinder's original text analysis routines were designed to help test developers locate source texts for use on the Verbal Section of the Graduate Record Examination (GRE), an examination taken by students seeking admission to graduate school. Sheehan et al. (2007a, 2007b) compared SourceFinder's predictions of source acceptability to those provided by experienced human raters. The comparison was implemented on an independent cross-validation sample that included 1,000 candidate source texts that had each been rated by two experienced test developers. The analysis confirmed that SourceFinder's predictions of source acceptability behave much like the ratings provided by trained human raters. For example, while the human raters agreed with each other 63% of the time, the agreement between SourceFinder and a human rater ranged from 61% to 62%. These findings suggest that SourceFinder's automated text analysis routines have succeeded in capturing useful information about the characteristics of texts that affect test developers' ratings of source acceptability, at least for texts pitched at the advanced proficiency level targeted by the GRE.
[Note, although the test developers' ratings were originally expressed on a five-point scale where 1=Definitely Reject, 2=Probably Reject, 3=Uncertain, 4=Probably Accept, and 5=Definitely Accept, Levels 1 and 2 were subsequently collapsed to form a single “Reject” category, and Levels 4 and 5 were subsequently collapsed to form a single “Accept” category. The evaluation was implemented on the resulting three-point scale.]
This application describes an automated text analysis module that provides text reading difficulty estimates expressed on a U.S. grade level (GL) scale. The capability is designed to help teachers and assessment developers locate (or create) texts appropriate for use on reading assessments targeted at students in grades 3 through 12. Before describing the many innovative aspects of this new capability, we provide a brief review of existing approaches for assessing text reading difficulty.
A Review of Existing Approaches for Assessing Text Reading Difficulty
Early attempts to assess text reading difficulty are reviewed in Klare (1984). Four popular approaches are described: the Flesch Reading Ease Index (Flesch, 1948), the Dale-Chall Readability Formula (Chall & Dale, 1948), the Fry Index (Fry, 1968), and the Flesch-Kincaid GL Score (Dubay, 2004, pp. 49-51). These four approaches, also called readability formulas, are alike in that, in each case, text difficulty is determined from just two independent variables: a single measure of syntactic difficulty and a single measure of semantic difficulty. In all four approaches average sentence length is taken as the single measure of syntactic difficulty. The approaches differ in terms of the specific features selected for use in measuring semantic difficulty. In three of the approaches, i.e., Flesch, Flesch-Kincaid, and Fry, semantic difficulty is assessed via average word length measured in syllables. In the Dale-Chall formula, semantic difficulty is assessed via the average frequency of words expected to be familiar to young readers. In the original Dale-Chall formula, word familiarity was assessed via a 1974 list of words found to be very familiar to fourth-grade students. In a revised version of the Dale-Chall formula published in 1995, semantic difficulty is assessed via an updated list of 3,000 words found to be very familiar to fourth grade students (Chall & Dale, 1995, pp. 16-29).
A number of additional readability formulas have been published. These include the Powers, Sumner, Kearl Readability Formula (Dubay, 2004, pp. 43-45), the Coleman Liau Formula (Coleman & Liau, 1975), the Bormuth Formula (Dubay, 2004, pp. 43-45) and the Gunning FOG formula (Gunning, 1952, pp. 29-39). As in the formulas discussed above, these additional formulas capture just two aspects of text variation: syntactic complexity, measured via average sentence length, and semantic difficulty, measured via average word length and/or average word familiarity.
A characteristic shared by all of the formulas discussed above is that, in each case, only a limited amount of computing power is needed for feature extraction. In 1988, Stenner, Horabin, Smith and Smith proposed an updated text difficulty prediction system that was designed to take advantage of recent increases in computing power. This new system, termed the Lexile Framework, is now widely used in elementary and middle school classrooms throughout the United States. Like the early readability formulas discussed above, however, the Lexile Framework considers just two aspects of text variation: syntactic difficulty and semantic difficulty. Syntactic difficulty is assessed via log average sentence length, and semantic difficulty is assessed by first using a word frequency index to assign an individual frequency estimate to each word in the text, and then averaging over those estimates to obtain a single, text-level estimate of reading difficulty. The individual word frequency estimates employed in the calculations were developed from a large corpus of natural language text selected to represent the reading materials typically considered by students in their home and school-based reading.
Although the approaches discussed above have been frequently praised for being both helpful and easy to use, a number of limitations have also been noted. One important limitation is that, as is noted above, only two dimensions of text variation are considered: syntactic difficulty and semantic difficulty. Sawyer (1991) argues that this simple model of text difficulty is “misleading and overly simplistic” (p. 309). Similarly, Coupland (cited in Klare, 1984) notes that “the simplicity of . . . readability formulas . . . does not seem compatible with the extreme complexity of what is being assessed” (p. 15). Holland (1981) reports a similar conclusion, “While sentence length and word frequency do contribute to the difficulty of a document, a number of equally important variables elude and sometimes run at cross purposes to the formulas . . . ” (p. 15)
Perhaps the most worrisome criticisms have been voiced by researchers who have attempted to manipulate text difficulty by manipulating sentence length and word familiarity. For example, Beck, McKeown & Worthy (1995) reported that, contrary to expectation, texts that were revised to include shorter sentences and more familiar words tended to yield decreases in comprehension, not increases. Similar results are reported in Britton & Gulgoz (1991) and in Pearson & Hamm (2005).
Researchers have also argued that a key limitation of existing readability formulas is their inability to account for discourse level factors such as the amount of referential cohesion present in a text (Graesser, McNamara, Louwerse & Cai, 2004; McNamara, Ozuru, Graesser & Louwerse, 2006; Crossley, Dufty, McCarthy & McNamara, 2007). McNamara, et al. (2006) define referential cohesion as the extent to which sentences appearing later in a discourse refer back to sentences appearing earlier in the discourse. They note that a referentially cohesive text spells out what another text might leave implicit, thereby reducing the need for bridging inferences. For this reason, texts with higher levels of referential cohesion are expected to be easier to comprehend than texts with lower levels of referential cohesion.
Graesser, et al. (2004) describe an automated text analysis system designed to measure various types of text cohesion. This new system is called Coh-Metrix in order to emphasize the crucial role of cohesion variation is determining text difficulty. Coh-Metrix includes 40 different indices of text cohesion. McNamara, et al. (2006) examined the performance of these indices relative to the task of detecting intentional cohesion manipulations made by experts in text comprehension. The experts created two versions of each of 19 different texts: a low-cohesion version and a high-cohesion version. The performance of each index relative to the task of distinguishing between these two versions was then examined. Significant differences were observed for 28 of the 40 indices. Importantly, however, neither Graesser et al. (2004) nor McNamara et al. (2006) proposed a new readability formula. Rather, each group focused exclusively on the development and evaluation of alternative approaches for measuring text cohesion.
A subsequent analysis of Coh-metrix features is reported in Crossley, et al. (2007). These researchers investigated whether Coh-metrix indices of referential cohesion could yield improved estimates of text readability when considered in combination with the classical readability features of average sentence length and average word frequency. The analysis suggested that a strategy of adding a measure of referential cohesion to a model that already includes measures of average sentence length and average word frequency would, in fact, contribute to enhanced predictive accuracy.
A three feature model is also presented in Sheehan, Kostin, Futagi & Sabatini (2006). In addition to measures of syntactic complexity and semantic difficulty, their model also includes a measure of the prevalence of linguistic structures that are known to be more characteristic of spontaneous spoken language than of printed language
Innovative estimation techniques designed to accommodate even larger numbers of independent variables have also been considered. For example, Petersen and Ostendorf (2006) describe a support vector machine designed to classify texts as either appropriate or not appropriate for students reading at any of four different grade levels ranging from second to fifth grade. Their approach considers a total of 26 features, including 20 different measures of vocabulary usage, and six different measures of syntactic complexity.
The ability to consider large numbers of independent variables simultaneously is also a feature of the text analysis system described in Heilman, Collins-Thompson, Callan & Eskenazi (2007). These authors employed a Naïve Bayes approach to simultaneously evaluate a large number of lexical features (i.e., word unigrams) and a large number of grammatical features (i.e., frequencies of grammatically complex constructions.) Similar findings are reported in Heilman, Collins-Thompson & Eskenazi (2008), i.e., models composed of word unigrams and frequencies of grammatical constructions proved effective at predicting human judgments of text GL.
The reading level assessment system described in Sheehan, Kostin & Futagi (2007c) also incorporates a large number text features. That system is unique in that (a) a tree-based regression approach is used to model text difficulty, and (b) distinct models are provided for literary and expository texts.
Limitations of Existing Approaches
Each of the approaches reviewed above suffers from one or more of the following limitations:    (1) The approach does not provide difficulty predictions expressed on a GL scale that is aligned with published state reading standards.    (2) The approach does not account for the fact that many important linguistic features interact significantly with genre.    (3) The approach does not account for the fact that many important linguistic features exhibit strong intercorrelations.    (4) The approach considers just two dimensions of text variation: syntactic complexity and semantic difficulty.    (5) The approach does not provide feedback for use in adapting text content so that resulting “adapted” texts are more appropriately configured for students reading at particular targeted reading GLs.Additional information about each limitation is summarized below.Limitation #1: The Specified GL Scale is not Aligned with Published State Reading Standards
Every modeling application requires an approach for ensuring that the predictions generated by the model are reported on an appropriate scale. Defining an appropriate prediction scale for a text difficulty modeling application is particularly challenging because (a) the “true” difficulty level of a passage is never directly observed, and (b) in some cases, there is a further requirement that the application yield text difficulty predictions that are reasonably well aligned with published state reading standards.
In many of the prediction models reviewed above, output predictions are reported on a U.S. GL scale. Four different techniques have been used to establish these scales: (a) doze fill-in rates; (b) small-scale rating studies; (c) item difficulty studies; and (d) Web downloads. These four techniques are described below.
Validation information collected via a doze fill-in approach is reported for a number of different models including the Bormuth readability formula, (Dubay, 2004, pp. 43-45) the Dale-Chall readability formula (Chall and Dale, 1995, pp. 1-44, 55-66) and the model presented in Crossley, et al. (2007). A modified doze fill-in approach is one of several approaches used to validate the Lexile Framework (Stenner, et al., 1988).
The basic doze fill-in approach includes three steps: first, passages are administered with every fifth word deleted and examinees are asked to “fill-in” the missing words; second, the average probability of a correct fill-in is calculated for each passage; and third, a linking procedure is used to re-express the resulting probability estimates on a U.S. GL scale. Note that the validity of this approach rests on the assumption that passages with high fill-in probabilities are easier to comprehend, while passages with low fill-in probabilities are harder to comprehend. Shanahan, Kamil, and Tobin (1983) evaluated this assumption by comparing students' performances on doze items administered under four different passage conditions:    (a) intact passages;    (b) scrambled passages (with sentences randomly reordered);    (c) intermingled passages (with sentences from different passages interspersed); and    (d) eclectic passages (collections of unrelated sentences).
After observing similar doze fill-in rates under all four conditions, Shanahan et al. (1983) concluded that doze fill-in rates do not provide useful information about “intersentential” comprehension, that is, comprehension that requires integrating information across sentence boundaries. This suggests that, while doze fill-in rates may provide useful information about the difficulties experienced by readers when attempting to comprehend the individual sentences in a text, they do not provide useful information about the difficulties experienced by readers when attempting to infer connections between sentences. This finding was later replicated by Leys, Fieding, Herman & Pearson (1983). Kintsch and Yarbrough (1982) reported a related finding, i.e., doze fill-in rates failed to distinguish passages classified as requiring low or high levels of macroprocessing, i.e., processing directed at developing a useful mental model of the information presented in a text.
Responses to multiple-choice reading comprehension items have also been used to establish output scales for text difficulty modeling applications (e.g., Chall and Dale, 1995; Stenner 1996). In this approach, the “true” difficulty level of a text is estimated from the average difficulty of its associated items and a linking technique is used to re-express the resulting difficulty predictions on a U.S. GL scale. A problem with this approach is that, while item difficulty is surely related to passage difficulty, several previous studies have suggested that difficulty estimates developed for multiple choice reading comprehension items also incorporate variation due to non-passage factors such as distractor plausibility, where the term “distractor” refers to the incorrect options that are presented along with the correct option. (Embretson & Wetzel, 1987; Freedle & Kostin, 1991; Gorin & Embretson, 2006; Sheehan, Kostin & Persky, 2006.)
Small scale rating studies have also been used to establish scales for use with automated text difficulty prediction systems. For example, Pitler & Nenkova (2008) created a five point difficulty scale by asking a group of three college students to rate each of 30 different Wall Street Journal articles on a 1-5 scale. Such scales suffer from each of the following limitations: (a) sample sizes are typically quite small (e.g., just 30 in the Pitler example); (b) interpretation is problematic (e.g., What does a difficulty estimate of “5” mean?); (c) ratings are not generated in a high-stakes environment; and (d) resulting text difficulty classifications are not aligned with published state reading standards.
Researchers have also generated text difficulty scales from GL classifications provided by textbook publishers or Web content providers. The training data described in Heilman et al. (2007) and Heilman, et al. (2008) illustrate this approach. In each case, training texts were downloaded from Web pages classified as appropriate for readers at specified GLs. Of the 289 texts collected in this manner, approximately half were authored by students at the specified GL, and half were authored by teachers or writers. In each case, either the text itself, or a link to it, identified the text as appropriate for students at a particular GL. This approach offers two advantages: (a) it is capable of yielding large numbers of training documents, and (b) it provides text difficulty classifications that capture variation due to both inter and intra-sentential comprehension. But certain limitations also apply: (a) difficulty classifications are not generated in a high stakes environment, (b) classification procedures are not published (so that the specific factors considered during text classification are not known and users have no way of determining whether the resulting predictions are aligned with published state reading standards) and (c) there is no preset process for detecting and correcting misclassifications.
As the above summary suggests, the lack of a carefully developed, well-aligned set of training texts is a serious weakness of many existing approaches for predicting text reading difficulty.
Limitation #2: Estimation Methodologies are not Designed to Account for Interactions with Text Genre
Research conducted over the past 20 years suggests that many important predictors of text difficulty interact significantly with text genre. This research includes a host of studies documenting significant differences in the characteristics of informational and literary texts, and in the strategies adopted by readers during the process of attempting to make sense of these two types of texts. Differences have been reported in the frequency of “core” vocabulary words (Lee, 2001); in the way that cohesion relations are expressed (McCarthy, Graesser & McNamara, 2006); in the types of comprehension strategies utilized (Kukan & Beck, 1997); in the rate at which individual paragraphs are read (Zabrucky & Moore, 1999); in the types of inferences generated during reading (van den Broek, Everson, Virtue, Sung & Tzeng, 2002); and in the type of prior knowledge accessed during inference generation (Best, Floyd & McNamara, 2004).
Several explanations for these differences have been proposed. In one view, literary texts (e.g., fictional stories and memoirs) are said to require different processing strategies because they deal with more familiar concepts and ideas (Graesser, McNamara & Louwerse, 2003). For example, while many literary texts employ familiar story grammars that are known to even extremely young children, informational texts tend to employ less well known structures such as cause-effect, comparison-contrast, and problem-resolution.
Genre-specific processing differences have also been attributed to differences in the types of vocabularies employed. For example, Lee (2001) examined variation in the frequency of “core” vocabulary words within a corpus of informational and literary texts that included over one million words downloaded from the British National Corpus. Core vocabulary was defined in terms of a list of 2000 common words classified as appropriate for use in the dictionary definitions presented in the Longman Dictionary of Contemporary English. The analyses demonstrated that core vocabulary usage was higher in literary texts than in informational texts. For example, when literary texts such as fiction, poetry and drama were considered, the percent of total words classified as “core” vocabulary ranged from 81% to 84%. By contrast, when informational texts such as science and social studies texts were considered, the percent of total words classified as “core” vocabulary ranged from 66% to 71%. In interpreting these results Lee suggested that the creativity and imaginativeness typically associated with literary writing may be less closely tied to the type or level of vocabulary employed and more closely tied to the way that core words are used and combined. Note that this implies that an individual word detected in a literary text may not be indicative of the same level of processing challenge as that same word detected in an informational text.
Significant genre-related differences have also been reported in more recent corpus-based analyses. For example, McCarthy et al. (2006) reported higher levels of referential cohesion in expository texts as compared to narratives even though the two corpora studied were targeted at equivalent populations of readers, i.e., students in grades kindergarten through college. These results suggest that it may also be the case that a particular level of referential cohesion detected in an expository text may not necessarily be indicative of the same type of processing challenge as that same level detected in a narrative text.
Explanations of informational/literary processing differences have also been cast in terms of the processing distinctions emphasized in Kintch's (1988) model of reading comprehension. That model, termed the Construction Integration Model, posits three separable, yet interacting processing levels. First, word recognition and decoding processes are used to translate the written code into meaningful language units called propositions. Next, interrelationships among the propositions are clarified. Depending on the characteristics of the text and the reader's goals, this processing could involve reader-generated bridging inferences designed to fill in gaps and establish coherence. Kintsch argues that this process culminates in the development of a network representation of the text called the textbase. While only text-based inferences are generated during the construction of the textbase, knowledge-based inferences may also be needed to completely satisfy a reader's goals. Consequently, a third level of processing is also frequently implemented. This third level involves reconciling the current text with relevant prior knowledge and experience to provide a more complete, more integrated model of the situation presented in the text, i.e., what Kintsch terms the situation model.
Best, et al. (2004) discuss differences in the type of prior knowledge accessed during situation model development for expository vs. narrative texts. They note that, for expository texts, situation model processing involves integrating the textbase with readers' prior knowledge of the subject matter, and since a given reader's prior knowledge may not always be sufficient, resulting situation models may fail to maintain the author's intended meaning. For narrative texts, by contrast, situation model processing typically involves generating inferences about the characters, settings, actions and events in the reader's mental representation of the story, an activity that is much less likely to be affected by deficiencies in required prior knowledge.
Although few would dispute the informational/literary distinctions summarized above, text difficulty models that account for these differences are rare. In particular, in all but one of the text difficulty predictions systems reviewed above, a single prediction equation is assumed to hold for both informational and literary texts. The one exception to this trend in the difficulty model described in Sheehan, Kostin & Futagi (2007c) which provides two distinct difficulty models: one optimized for informational texts and one optimized for literary texts.
Limitation #3: Estimation Procedures are not Designed to Account for the Strong Intercorrelations that may Exist among Important Text Features
The extreme complexity of the reading comprehension process suggests that large numbers of text features may be needed to adequately explain variation in text difficulty. In many popular difficulty modeling approaches, however, models are estimated from a mere handful of text features. For example, both the Flesch-Kincaid GL Score and the Lexile Framework rely on just two features. This surprising result may be due to the difficulty of accounting for the strong intercorrelations that are likely to exist among many related text features.
Biber (1986, 1988) and Biber, Conrad, Reppen, Byrd, Helt, Clark (2004) describe an approach for characterizing text variation when the available text features exhibit strong intercorrelations. In this approach, corpus-based multidimensional techniques are used to locate clusters of features that simultaneously exhibit high within-cluster correlation and low between-cluster correlation. Linear combinations defined in terms of the identified feature clusters are then employed for text characterization. Biber and his colleagues justify this approach by noting that (a) because many important aspects of text variation are not well captured by individual linguistic features, investigation of such characteristics requires a focus on “constellations of co-occurring linguistic features” as opposed to individual features, and (b) multidimensional techniques applied to large representative corpora may help researchers better understand and interpret those constellations by allowing patterns of linguistic co-occurrence to be analyzed in terms of underlying dimensions of variation that are identified quantitatively.
In Biber et al. (2004), a corpus-based multidimensional approach is applied to the problem of selecting texts appropriate for use on the Test of English as a Foreign Language (TOEFL). System development involved first using a principal factor analysis to develop linear combinations of text features for use in text evaluation, and then using the resulting “dimension scores” to compare candidate texts to existing TOEFL passages. Sheehan, et al. (2007a, 2007b) employ a similar approach to define independent variables for use in modeling the acceptability status of candidate source texts selected for use on the GRE Verbal Section. Dimension scores developed in a multidimensional analysis of a large corpus have also been used to examine differences in the patterns of text variation observed in reading materials classified as exhibiting low or high Lexile Scores (Deane, Sheehan, Sabatini, Futagi and Kostin, 2006). And finally, Louwerse, McCarthy, McNamara and Graesser (2004) employ a similar approach to examine variation in a set of cohesion indices.
In considering the analyses summarized above it is important to note that, while each employed linear combinations of correlated text features to explore some aspect of text variation, none of these previous applications were designed to predict variation in text difficulty, and none provide text GL predictions that are reflective of the GL distinctions specified in published state reading standards.
Limitation #4: Only Two Dimensions of Text Variation are Considered: Syntactic Complexity and Semantic Difficulty
Early efforts to automatically assess text difficulty focused on two particular dimensions of text variation: syntactic complexity and semantic difficulty. While innovative approaches for measuring these two important dimensions continue to be published, attempts to measure additional dimensions are rare. Text processing models such as Kintsch's Construction Integration model (1988) suggest that text difficulty prediction models that measure additional dimensions of text variation, over and above the traditional readability dimensions of syntactic complexity and semantic difficulty, may provide more precise information about the aspects of text variation that account for students' observed comprehension difficulties.
Limitation #5: Feedback is not Designed to Help Users Develop High Quality Text Adaptations
Text adaptation is the process of adjusting text characteristics so that the resulting “adapted” text exhibits combinations of features that are characteristic of a particular targeted GL. Previous research has suggested that (a) adaptation strategies developed from overly simplistic models of text variation can result in adapted texts that are not characteristic of targeted GLs, and (b) such texts frequently fail to elicit the types of performances expected of students with known proficiency profiles (Beck, et al. 1995; Britton & Gulgoz, 1991; Pearson & Hamm, 2005). Text reading difficulty models that are more reflective of the patterns of text variation observed in naturally occurring texts may yield feedback that is more appropriate for use in text adaptation activities.
The research summarized above highlights the need for a text reading difficulty prediction module that (a) yields text reading difficulty predictions expressed on a U.S. GL scale that is reasonably well aligned with published state reading standards; (b) accounts for the fact that many important linguistic features interact significantly with text genre; (c) accounts for the fact that many important linguistic features exhibit strong intercorrelations; (d) addresses multiple aspects of text variation, including aspects that are not accounted for by the classic readability dimensions of syntactic complexity and semantic difficulty, and (e) provides feedback for use in creating high quality text adaptations. This application describes a new text reading difficulty prediction module designed to address these concerns.