The present invention relates to standardized test evaluation. More particularly, the present invention relates to a tree-based approach to proficiency scaling and diagnostic assessment of standardized test results.
The traditional outcome of an educational test is a set of test scores reflecting the numbers of correct and incorrect responses provided by each student. While such scores may provide reliable and stable information about students"" standing relative to a group, they fall short of indicating the specific patterns of skill mastery underlying students"" observed item responses. Such additional information may help students and teachers better understand the meaning of test scores and the kinds of learning which might help to improve those scores.
Procedures for translating observed test results into instructionally-relevant statements about students"" underlying patterns of skill mastery may be designed to provide student-level diagnostic information or group-level diagnostic information. Student-level diagnoses characterize the individual strengths and weaknesses of individual students. Group-level diagnoses characterize the strengths and weaknesses expected for students scoring at specified points on a test""s reported score scale. A collection of group-level diagnoses designed to span a test""s reported score range is termed a proficiency scale.
Both group- and student-level diagnoses can provide useful feedback. The detailed information available from a student-level diagnosis can help human or computerized tutors design highly individualized instructional intervention. The cross-sectional view provided by a set of group-level diagnoses can be used to: (a) demonstrate that the skills tapped by a particular measurement instrument are in fact those deemed important to measure, and (b) suggest likely areas of improvement for individual students. Both types of diagnoses can also be used to inform course placement decisions.
Procedures for generating group-level and/or student-level diagnoses have been proposed by a number of researchers. Beaton and Allen proposed a procedure called Scale Anchoring which involved (a) identifying subsets of test items which provided superior discrimination at successive points on a test""s reported score scale; and (b) asking subject-area experts to review the items and provide detailed descriptions of the specific cognitive skills that groups of students at or close to the selected score points would be expected to have mastered. (Beaton, A. E. and N. L. Allen, Interpreting scales through scale anchoring, Journal of Educational Statistics, vol. 17, pp. 191-204, 1992.) This procedure provides a small number of group-level diagnoses, but no student-level diagnoses. The estimated group-level diagnoses are specified in terms of the combinations of skills needed to solve items located at increasingly higher levels on a test""s reported score scale.
Tatsuoka, Birenbaum, Lewis, and Sheehan outlined an approach which provides both student- and group-level diagnoses. (Tatsuoka, K.K., Architecture of knowledge structures and cognitive diagnosis, P. Nichols, S. Chipman and R. Brennan, Eds., Cognitively diagnostic assessment Hillsdale, N.J.: Lawrence Erlbaum Associates, 1995. Tatsuoka, K., M. Birenbaun, C. Lewis, and K. Sheehan, Proficiency scaling based on conditional probability functions for attributes, ETS Research Report No. RR-93-50ONR, Princeton, N.J.: Educational Testing Service, 1993.) Student-level diagnoses are generated by first hypothesizing a large number of latent skill mastery states and then using a Mahalanobis distance test (i.e. the Rule Space procedure) to classify as many examinees as possible into one or another of the hypothesized states. The classified examinees"" hypothesized skill mastery patterns (i.e. master/nonmaster status on each of k skills) are then summarized to provide group-level descriptions of the skill mastery status expected for students scoring at successive points on a test""s reported score scale. For example, in an analysis of 180 mathematics items selected from the Scholastic Assessment Test (SAT 1), 94% of 6,000 examinees were classified into one of 2,850 hypothesized skill mastery states (Tatsuoka, 1995, pg 348).
Gitomer and Yamamoto generate student-level diagnoses using the Hybrid Model. (Gitomer, D. H. and K. Yamamoto, Performance modeling that integrates latent trait and latent class theory, Journal of Educational Measurement, vol. 28, pp. 173-189, 1991.) In this approach, likelihood-based inference techniques are used to classify as many examinees as possible into a small number of hypothesized skill mastery states. For example, in an analysis of 288 logic gate items, 30% of 255 examinees were classified into one of five hypothesized skill mastery states (Gitomer and Yamamoto at 183). For each of the remaining examinees, Gitomer et al. provided an Item Response Theory (IRT) ability estimate which indicated standing relative to other examinees but provided no additional information about skill mastery.
Mislevy, Gitomer, and Steinberg generate student-level diagnoses using a Bayesian inference network. (Mislevy, R. J., Probability-based inference in cognitive diagnosis, P. Nichols, S. Chipman, and R. Brennan, Eds., Cognitively diagnostic assessment, Hillsdale, N.J.: Lawrence Erlbaum Associates, 1995. Gitomer, D. H., L. S. Steinberg, and R. J. Mislevy, Diagnostic assessment of troubleshooting skill in an intelligent tutoring system, P. Nichols, S. Chipman, and R. Brennan, Eds., Cognitively diagnostic assessment, Hillsdale, N.J.: Lawrence Erlbaum Associates, 1995.) This approach differs from the approaches described previously in two important respects: (1) students"" observed item responses are modeled conditional on a multivariate vector of latent student-level proficiencies, and (2) multiple sources of information are considered when diagnosing mastery status on each of the hypothesized proficiencies. For example, in an analysis of fifteen fraction subtraction problems, nine student-level variables were hypothesized and information about individual skill mastery probabilities was derived from two sources: population-level skill mastery base rates and examinees"" observed item response vectors (Mislevy, 1995).
In each of the diagnostic approaches described above, it is assumed that the test under consideration is a broad-based proficiency test such as those that are typically used in educational settings. Lewis and Sheehan consider the problem of generating student-level diagnoses when the item response data is collected via a mastery test, that is, a test designed to provide accurate measurement at a single underlying proficiency level, such as a pass/fail point. (Lewis, C. and K. M. Sheehan, Using Bayesian decision theory to design a computerized mastery test, Applied Psychological Measurement, vol. 14, pp. 367-386, 1990. Sheehan, K. M. and C. Lewis, Computerized mastery testing with nonequivalent testlets, Applied Psychological Measurement, vol. 16, pp. 65-76, 1992.) In this approach, decisions regarding the mastery status of individual students are obtained by first specifying a loss function and then using Bayesian decision theory to define a decision rule that minimizes posterior expected loss.
The prior art methods are known to be computationally intensive and not to consider any observed data. Moreover, these approaches are form dependent. That is, the set of knowledge states obtained excludes all states that might have been observed with a different form, but could not have been observed with the current form. Finally, the prior art methods cannot capture states involving significant interaction effects if those effects are not specified in advance.
Thus there is a need in the art for a less computationally intensive method designed to search for, and incorporate, all significant skill-mastery patterns that can be determined from the available item difficulty data There is a further need in the art for a form independent approach that provides all of the knowledge states which could have been observed, given the collection of forms considered in the analysis. There is a further need in the art for an approach that automatically incorporates all identified interaction states so that the success of the procedure is not critically dependent on detailed prior knowledge of the precise nature of the true underlying proficiency model.
The present invention fulfills these needs by providing methods for diagnostic assessment and proficiency scaling of test results for a plurality of tests, each test having at least one item and each item having at least one feature. The method of the invention uses as input a vector of item difficulty estimates for each of n items and a matrix of hypothesized skill classifications for each of the n items on each of k skills. A tree-based regression analysis based on the input vector and matrix is used to model ways in which required skills interact with different item features to produce differences in item difficulty. The tree-based analysis identifies combinations of skills required to solve each item.
A plurality of clusters is formed by grouping the items according to a predefined prediction rule based on skill classifications. Preferably, the plurality of clusters is formed by successively splitting the items, based on the identified skill classifications, into increasingly homogeneous subsets called nodes. For example, the clusters can be formed by selecting a locally optimal sequence of splits using a recursive partitioning algorithm to evaluate all possible splits of all possible skill classification variables at each stage of the analysis. In a preferred embodiment, a user can define the first split in the recursive analysis.
Ultimately, a plurality of terminal nodes is formed by grouping the items to minimize deviance among items within each terminal node and maximize deviance among items from different terminal nodes. At this point, a mean value of item difficulty can be determined for a given terminal node based on the items forming that node. The value of item difficulty is then predicted, for each item in the given terminal node, to be the corresponding mean value of item difficulty.
A nonparametric smoothing technique is used to summarize student performance on the combinations of skills identified in the tree-based analysis. The smoothing technique results in cluster characteristic curves that provide a probability of responding correctly to items with specified skill requirements. This probability is expressed as a function of underlying test score.
Group-level proficiency profiles are determined from the cluster characteristic curves for groups of examinees at selected underlying test scores. Student-level diagnoses are determined by deriving an expected cluster score from each cluster characteristic curve and comparing a cluster score for each examine to the expected cluster score.
In another preferred embodiment of a method according to the present invention, a vector of item difficulty estimates for each of n items is defined, along with a matrix of hypothesized skill classifications for each of the n items on each of k hypothesized skills. A tree-based regression technique is used to determine, based on the vector and matrix, the combinations of cognitive skills underlying performance at increasingly advanced levels on the test""s underlying proficiency scale using. Preferably, the combinations are determined by forming a plurality of terminal nodes by grouping the items to minimize deviance among items within each terminal node and maximize deviance among items from different terminal nodes. The combinations are validated using a classical least squares regression analysis. The set of all possible subsets of combinations of cognitive skills that could have been mastered by an individual examinee is generated and the k hypothesize skills are redefined to form a set of kxe2x80x2 redefined skills such that each of the kxe2x80x2 redefined skills represents one of the terminal nodes.