This invention relates generally to construction of computerized adaptive tests, and in particular to a novel method of utilizing expert test development practices in the construction of adaptive tests.
Conventional multiple-choice tests, which are administered to large numbers of examinees simultaneously by using paper-and-pencil, have been commonly used for educational testing and measurement for many years. Such tests are typically given under standardized conditions, where every examinee takes the same or a parallel test form. This testing strategy represents vastly reduced unit costs over the tests administered individually by examiners that existed during the early part of this century.
However, there remains great interest in restoring some of the advantages of individualized testing. William Turnbull suggested investigations in this direction in 1951 and coined the phrase "tailored testing" to describe this possible paradigm (Lord, 1980, p. 151) (full citations for this and other references are given in the References section below). Possibilities for constructing individualized tests became likely with the advent of Item Response Theory (IRT) (Lord, 1952, 1980) as a psychometric foundation. Beginning in the 1960's, Lord (1970, 1971a) began to explore this application of IRT by investigating various item selection strategies borrowed from the bioassay field. Later work by Lord (1977, 1980) and Weiss (1976, 1978) laid the foundation for the application of adaptive testing as an alternative to conventional testing.
Adaptive tests are tests in which items are selected to be appropriate for the examinee; the test adapts to the examinee. All but a few proposed designs, for example, Lord's (1971b) flexilevel test, have assumed that items would be chosen and administered to examinees on a computer, thus the term computerized adaptive testing, or CAT. Adaptive testing using multiple-choice items has received increasing attention as a practical alternative to paper-and-pencil tests as the cost of modern low-cost computing technology has declined. The Department of Defense has seriously considered its introduction for the Armed Services Vocational Aptitude Battery (CAT-ASVAB) (Wainer, et al., 1990); large testing organizations have explored and implemented CAT, e.g. the implementation of adaptive testing by Educational Testing Service and the College Entrance Examination Board for the College Placement Tests (CPTs) program (College Board, 1990); certification and licensure organizations are paying increased attention to adaptive testing as a viable alternative (Zara, 1990).
Conventional Test Construction PA0 (1) Constraints on Intrinsic Item Properties PA0 (2) Constraints That Focus on Item Features in Relation to All Other Candidate Items PA0 (3) Constraints On Item Features in Relation to a Subset of All Other Candidate Items PA0 (4) Constraints On the Statistical Properties of Items PA0 Adaptive Test Construction PA0 Binary Programming Model PA0 A Model for Solving Large Problems
Conventional test construction--the construction of multiple-choice tests for paper-and-pencil administration--is time consuming and expensive. Aside from the costs of writing and editing items, items must be assembled into test forms. In typical contexts found in public and private testing organizations, a goal is to construct the most efficient test possible for some measurement purpose. This requires that item selection be subject to various rules that govern whether or not an item may be included in a test form. Such rules are frequently called test specifications and constitute a set of constraints on the selection of items.
These constraints can be considered as falling into four separate categories: (1) constraints that focus on some intrinsic property of an item, (2) constraints that focus on item features in relation to all other candidate items, (3) constraints that focus on item features in relation to a subset of all other candidate items, and (4) constraints on the statistical properties of items as derived from pretesting.
Tests built for a specific measurement purpose typically have explicit constraints on item content. For example, the test specifications for a test in mathematics may specify the number or percentage of items on arithmetic, algebra, and geometry. These specifications may be further elaborated by a specification that a certain percentage of arithmetic items involve operations with whole numbers, a certain percentage involve fractions, a certain percentage involve decimals. Likewise, a percentage might be specified for algebra items involving real numbers as opposed to symbolic representations of numbers, and so forth. It is not unusual for fairly extensive test specifications to identify numerous content categories and subcategories of items and their required percentages or numbers.
In addition to constraints explicitly addressing item content, constraints are typically given for other features intrinsic to an item that are not directly content related. For example, restrictions may be placed on the percentage of sentence completion items that contain one blank as opposed to two blanks, and two blanks as opposed to three blanks. These types of constraints treat the item type or the appearance of the item to the examinee. A second type of constraint not directly related to content may address the reference of the item to certain groups in the population at large, as when, for example, an item with a science content has an incidental reference to a minority or female scientist. Such constraints may also seek to minimize or remove the use of items that contain incidental references that might appear to favor social class or wealth, for example, items dealing with country clubs, golf, polo, etc. These types of constraints are frequently referred to as sensitivity constraints and test specifications frequently are designed to provide a balance of such references, or perhaps an exclusion of such references, in the interest of test fairness.
In addition to these more formal constraints on various features of items, there are frequently other less formal constraints that have developed as part of general good test construction practices for tests of this type. These constraints may seek to make sure that the location of the correct answer appears in random (or nearly random) locations throughout a test, may seek to encourage variety in items by restricting the contribution of items written by one item writer, and so forth.
It is evident that a test must not include an item that reveals the answer to another item. Wainer and Kiley (1987) describe this as cross-information. Kingsbury and Zara (1991) also describe this kind of constraint. In addition to giving direct information about the correct answer to another item, an item can overlap with other items in more subtle ways. Items may test the same or nearly the same point, but appear to be different, as in an item dealing with the sine of 90 degrees and the sine of 450 degrees. If the point being tested is sufficiently similar, then one item is redundant and should not be included in the test because it provides no additional information about an examinee.
Items may also overlap with each other in features that are incidental to the purpose of the item. For example, two reading comprehension passages may both about science and both may contain incidental references to female minority scientists. It is unlikely that test specialists would seek to include both passages in a general test of reading comprehension. We refer to items that give away answers to other items, items that test the same point as others, and items that have similar incidental features as exhibiting content overlap which must be constrained by the test specifications.
Test specialists who construct verbal tests or test sections involving discrete verbal items, that is, items that are not associated with a reading passage, are concerned that test specifications control a second kind of overlap, here referred to as word overlap. The concern is that relatively uncommon words used in any of the incorrect answer choices should not appear more than once in a test or test section. To do so is to doubly disadvantage those examinees with more limited vocabularies in a manner that is extraneous to the purposes of the test. For example, an incorrect answer choice for a synonym item may be the word "hegira." Test specialists would not want the word "hegira" to then appear in, for example, an incorrect answer choice for a verbal analogy item to be included in the same test.
Some items are related to each other through their relationship to common stimulus material. This occurs when a number of items are based on a common reading passage in a verbal test, or when a number of items are based on a common graph or table or figure in a mathematics test. If test specifications dictate the inclusion of the common stimulus material, then some set of items associated with that material is also included in the test. It may be that there are more items available in a set than need to be included in the test, in which case the test specifications dictate that some subset of the available items be included that best satisfy other constraints or test specifications.
Some items are related to each other not through common stimulus material, but rather through some other feature such as having common directions. For example, a verbal test might include synonyms and antonyms, and it might be confusing to examinees if such items were intermixed. Test specifications typically constrain item ordering so that items with the same directions appear together.
Whether items form groups based on common stimulus material or common directions or some other feature, we will describe these groups as item sets with the intended implication that items belonging to a set may not be intermixed with other items not belonging to the same set.
Information about the statistical behavior of items may be available from the pretesting of items, that is, the administration of these items to examinees who are similar to the target group of examinees. Test specifications typically constrain the selection of items based on their statistical behavior in order to construct test forms that have the desired measurement properties. If the goal of the measurement is to create parallel editions of the same test, these desired measurement properties are usually specified in terms of the measurement properties of previous test editions. If the goal of the measurement is to create a new test for, say, the awarding of a scholarship or to assess basic skills, test specifications will constrain the selection of items to hard items or easy items respectively.
These constraints typically take the form of specifying some target aggregation of statistical properties, where the statistical properties may be based on conventional difficulty and discrimination or the counterpart characteristics of items found in IRT. If IRT item characteristics are employed, the target might be some combination of item characteristics, as for example, target test information functions. If conventional item statistics are used, the target aggregation is usually specified in terms of frequency distributions of item difficulties and discriminations.
Early Monte Carlo investigations of adaptive testing algorithms concentrated predominantly on the psychometric aspects of test construction (see, for example, Lord, 1970, 1971a, 1971b). Such investigations eventually led to IRT-based algorithms that were fast, efficient, and psychometrically sound. A review of the most frequently used algorithms is given in Wainer, et al. (1990, Chapter 5) and Lord (1980, Chapter 9). The fundamental philosophy underlying these algorithms of the prior art is as follows:
1) An initial item is chosen on some basis and administered to the examinee.
2) Based on the examinee's response to the first item, a second item is chosen and administered. Based on the examinee's response to the first two items, a third item is chosen and administered, etc. In typical paradigms, the examinee's responses to previous items are reflected in an estimate of proficiency that is updated after each new item response is made.
3) The selection of items continues, with the proficiency estimate updated after each item response, until some stopping criterion is met.
4) The examinee's final score is the proficiency estimate after all items are administered.
When practical implementation became a possibility, if not yet a reality, researchers began to address the incorporation of good test construction practices as well as psychometric considerations into the selection of items in adaptive testing.
One of the first to do so was Lord (1977) in his Broad Range Tailored Test of Verbal Ability. The item pool for this adaptive test consisted of five different types of discrete verbal items. For purposes of comparability or parallelism of adaptive tests, some mechanism is necessary to prevent, for example, one examinee's adaptive test from containing items of only one type and another examinee's test containing only items of a different type. To exert this control, the sequence of item types is specified in advance, for example, the first item administered must be of type A, the second through fifth items must be of type B, and so forth. In this maximum-likelihood-based adaptive test, Lord selects items for administration based on maximum item information for items of the appropriate prespecified type in the sequence at an examinee's estimated level of ability.
In an attempt to control more item features, the approach of specifying the sequence of item types in advance can become much more elaborate, as in the CPTs (Ward, 1988) where the number of item types is as large as 10 or 15. In this context, items are classified as to type predominantly on the basis of intrinsic item features discussed previously. The same kind of control is used in the CAT-ASVAB (Segall, 1987). This type of content control has been called a constrained CAT (C-CAT) by Kingsbury and Zara (1989).
A major disadvantage of this approach of the prior art is that it assumes that item features of interest partition the available item pool into mutually exclusive subsets. Given the number of intrinsic item features that may be of interest to test specialists, the number of mutually exclusive partitions can be very large and the number of items in each partition can become quite small. For example, consider items that can be classified with respect to only 10 different item properties, each property having only two levels. The number of mutually exclusive partitions of such items is 2.sup.10 -1, or over 1000 partitions. Even with a large item pool, the number of items in each mutually exclusive partition can become quite small.
Nevertheless, such an approach would be possible except for the fact that no effort is made with this type of control to incorporate considerations of overlap or sets of items. These considerations could in theory be accomplished by further partitioning by overlap group and by set, but the number of partitions would then become enormous.
Wainer and Kiely (1987) and Wainer, et al. (1990) hypothesize that the use of testlets can overcome these problems. Wainer and Kiely define a testlet as a group of items related to a single content area that is developed as a unit and contains a fixed number of predetermined paths that an examinee may follow (1987, p. 190). They suggest that an adaptive test can be constructed from testlets by using the testlet rather than an item as the branching point. Because the number of paths through a fairly small pool of testlets is relatively small, they further suggest that test specialists could examine all possible paths. They hypothesize that this would enable test specialists to enforce constraints on intrinsic item features, overlap, and item sets in the same manner as is currently done with conventional tests.
Kingsbury and Zara (1991) investigated the measurement efficiency of the testlet approach to adaptive testing as compared to the C-CAT approach. Their results show that the testlet approach could require from 4 to 10 times the test length of the C-CAT approach to achieve the same level of precision. Aside from measurement concerns, the testlet approach rests on the idea that the pool of available items can be easily subdivided into mutually exclusive subsets (testlets), also a disadvantage of the C-CAT approach.
The testlet approach addresses overlap concerns within a testlet because the number of items in a testlet is small. It prevents overlap across testlets through the mechanism of a manual examination of the paths through the testlet pool. If the number of paths is large, this approach becomes difficult to implement.
A distinct advantage of the testlet approach over the C-CAT approach is the facility to impose constraints on the selection of sets of items related through common stimulus material or some other common feature. A single reading comprehension passage and its associated items could be defined as a testlet, for example, as long as the items to be chosen for that passage are fixed in advance as part of the testlet construction effort. The C-CAT approach can not be easily modified to handle this type of constraint.
Unlike prior methods of adaptive testing, the present invention is based on a mathematical model formatted as a binary programming model. All of the test specifications discussed above can be conveniently expressed mathematically as linear constraints, in the tradition of linear programming. For example, a specification such as "select at least two but no more than 5 geometry items" takes the form EQU 2.ltoreq..times..ltoreq.5
where x is the number of selected items having the property "geometry." Conformance to a specified frequency distribution of item difficulties takes the form of upper and lower bounds on the number of selected items falling into each specified item difficulty range.
Similarly, conformance to a target test information function takes the form of upper and lower bounds on the sum of the individual item information functions at selected ability levels. This is based on the premise that it is adequate to consider the test information function at discrete ability levels. This is a reasonable assumption given that test information functions are typically relatively smooth and that ability levels can be chosen to be arbitrarily close to each other (van der Linden, 1987).
A typical formulation of a binary programing model has the following mathematical form. Let i=1, . . . , N index the items in the pool, and let x.sub.i denote the decision variable that determines whether item i is included in (x.sub.i =1) or excluded from (x.sub.i =0) the test. Let j=1, . . . , J index the item properties associated with the non-psychometric constraints, let L.sub.j and U.sub.j be the lower and upper bounds (which may be equal) respectively on the number of items in the test having each property, and let a.sub.ij be 1 if item i has property j and 0 if it does not. Then the model for a test of fixed length n is specified as: ##EQU1## Note that equation (2) fixes the test length, while equations (3) and (4) express the non-psychometric constraints as lower and upper bounds on the number of items in the test with the specified properties.
The objective function, z, can take on several possible forms (see van der Linden and Boekkooi-Timminga, 1989, table 3). It typically maximizes conformance to the psychometric constraints. Examples include maximizing absolute test information; minimizing the sum of the positive deviations from the target test information; or minimizing the largest positive deviation from the target. Models that minimize the maximum deviation from an absolute or relative target are referred to as "minimax" models. The objective function can also take the form of minimizing test length, as in Theunissen (1985), or minimizing other characteristics of the test, such as administration time, frequency of item administration, and so forth. Finally, z could be a dummy variable that is simply used to cast the problem into a linear programming framework. Boekkooi-Timminga (1989) provides a thorough discussion of several of these alternatives.
If the binary programming model expressed in equations (1) through (5) is feasible (that is, has an integer solution), then it can be solved using standard mixed integer linear programming (MILP) algorithms (see, for example, Nemhauser & Wolsey, 1988). Several such models have been proposed and investigated using these methods. Considerable attention has also been devoted to methods of speeding up the MILP procedure (see, for example, Adema, 1988, and Boekkooi-Timminga, 1989).
Binary programming models, together with various procedures and heuristics for solving them, have been successful in solving many test construction problems. However, it is not always the case that the model (1) through (5) has a feasible solution. This may occur because one or more of the constraints in equation (3) or (4) is difficult or impossible to satisfy, or simply because the item pool is not sufficiently rich to satisfy all of the constraints simultaneously. In general, the binary programming model is increasingly more likely to be infeasible when the number of constraints is large because of the complexity of the interaction of constraints.
Studies reported in the literature have generally dealt with relatively small problems, with pool sizes on the order of 1000 or less and numbers of constraints typically less than 50. By contrast, we typically encounter pool sizes from 300 to 5000 or more, and numbers of constraints from 50 to 300. Moreover, many if not most of these constraints are not mutually exclusive, so that it is not possible to use them to partition the pool into mutually independent subsets. We have found that problems of this size, with this degree of constraint interaction, greatly increase the likelihood that the model (1) through (5) will not have a feasible solution.
Heuristic procedures for solving the model often resolve the feasibility problem. For example, Adema (1988) derives a relaxed linear solution by removing equation (5). Decision variables with large and small reduced costs are then set to 0 and 1, respectively, or the first integer solution arbitrarily close to the relaxed solution is accepted. Various techniques for rounding the decision variables from the relaxed solution have also been investigated (van der Linden and Boekkooi-Timminga, 1989). Heuristics such as these were designed to reduce computer time, but in many cases they will also ensure a feasible (if not optimal) solution to the binary model if there is a feasible solution to the relaxed linear model.
It is therefore an object of the present invention to provide a method of constructing adaptive tests which implements the aforementioned test specification constraints in a binary programming model which provides for automated item selection.