Standardized testing is prevalent in the United States today. Such testing is often used for higher education entrance examinations and achievement testing at the primary and secondary school levels. The prevalence of standardized testing in the United States has been further bolstered by the No Child Left Behind Act of 2001, which emphasizes nationwide test-based assessment to measure students' abilities to ensure appropriate grade placement and quality of education. However, unlike measurements that are made in the physical world, such as length and weight, measuring students' skills, knowledge and attributes that cannot be directly observed is challenging. Instead of measuring a particular skill, knowledge or attribute directly, the student must be measured based on a set of observable responses that are indicators of the skill, knowledge or attribute.
For example, if an examiner wanted to measure extraversion, it is not obvious what tool, or questionnaire, would be effective. Even if the examiner had an appropriate questionnaire, changes in repeated measurements of an individual's extraversion could be due to changes in both the construct and the error of measurement. Classical Test Theory (CTT) and Item Response Theory (IRT) provide methods for developing instruments to measure constructs such as extraversion. In addition, CTT and IRT both provide methods for obtaining an examinee's score, such as a score on a constructed extraversion scale.
The typical focus of research in the field of assessment measurement and evaluation has been on methods of IRT. A goal of IRT is to optimally order examinees along a low dimensional plane (typically, a one-dimensional plane) based on the examinee's responses and the characteristics of the test items. The ordering of examinees is done via a set of latent variables presupposed to measure ability. The item responses are generally considered to be conditionally independent of each other.
The typical IRT application uses a test to estimate an examinee's set of abilities (such as verbal ability or mathematical ability) on a continuous scale. An examinee receives a scaled score (a latent trait scaled to some easily understood metric) and/or a percentile rank. The final score (an ordering of examinees along a latent dimension) is used as the standardized measure of competency for an area-specific ability.
Although achieving a partial ordering of examinees remains an important goal in some settings of educational measurement, the practicality of such methods is questionable in common testing applications. For each examinee, the process of acquiring the knowledge that each test purports to measure seems unlikely to occur via this same low dimensional approach of broadly defined general abilities. This is, at least in part, because such testing can only assess a student's abilities generally, but cannot adequately determine whether a student has mastered a particular ability or not.
Alternatively, estimation of an examinee's “score” is not the focus in some cases. For example, a teacher may be interested in estimating students' profiles. The profile for each student specifies a set of dichotomous skills, or attributes, that a student has or has not mastered. A profile of discrete attributes provides the teacher with information about the instructional needs of groups of students (unlike multidimensional IRT which provides a profile of scores). Cognitive Diagnosis Models (CDMs) can be used when the interest of a test is to estimate students' profiles, or attribute mastery patterns, instead of providing a general estimate of ability.
Many high stakes decisions, such as admission to a school, require that examinees be ordered along several one-dimensional scales. Dichotomous decisions (e.g., accepted or not) are made based on whether an applicant's scores are higher than a determined threshold along each of the one-dimensional scales. For example, tests such as the Graduate Record Examination (GRE) provide examinees with a score from 200 to 800 for their general mathematical ability, analytical ability and verbal ability. An applicant to a school may only be accepted if he or she scores above a certain threshold (e.g., 500) on all three scales. Low stakes tests within a classroom can be used to determine how students are doing on a set of skills, or attributes, and do not necessarily require a score for each student. CDMs break down general ability into its basic elements or fine-grained attributes that make up ability.
CDMs model the probability of a correct response as a function of the attributes an examinee has mastered. If an examinee has mastered all of the attributes required for each step, it is likely that the item will be answered correctly. CDMs are used to estimate an examinee's mastery for a set of attributes given the responses to the items in a test (i.e., CDMs can be used for classification). All examinees that have mastered the same set of attributes form a class and have the same expected value on a given item. Therefore, many CDMs are a special case of latent class models where each class is defined by mastery or non-mastery of a set of attributes. In addition, CDMs can provide information about the quality of each item.
Numerous cognitive diagnosis models have been developed to attempt to estimate examinee attributes. In cognitive diagnosis models, the atomic components of ability, the specific, finely grained skills (e.g., the ability to multiply fractions, factor polynomials, etc.) that together comprise the latent space of general ability, are referred to as “attributes.” Due to the high level of specificity in defining attributes, an examinee in a dichotomous model is regarded as either a master or non-master of each attribute. The space of all attributes relevant to an examination is represented by the set {α1, . . . , αk}. Given a test with items i=1, . . . , J, the attributes necessary for each item can be represented in a matrix of size J×K. This matrix is referred to as a Q-matrix having values Q={qjk}, where qjk=1 when attribute k is required by item j and qjk=0 when attribute k is not required by item j. The Q-matrix is assumed to be known and currently there are only a few methods that can verify whether the Q-matrix is supported by the data. Also, the Q-matrix implicitly assumes that expert judges can determine the strategy used for each item and that only that strategy is used.
Since the Q-matrix should be designed such that the attribute parameters of all examinees can be estimated, if a test were to be constructed, some Q-matrices are naturally better than others. For example, the following represents two Q matrices, Q1 and Q2, for a five item test testing three attributes.
      Q    1    =                    (                                            1                                      0                                      0                                                          0                                      1                                      0                                                          0                                      0                                      1                                                          1                                      1                                      0                                                          0                                      1                                      1                                      )            ⁢                          ⁢              Q        2              =          (                                    0                                0                                1                                                1                                1                                0                                                0                                0                                1                                                1                                1                                0                                                0                                0                                1                              )      
Q1 corresponds to a test where each attribute is measured at least 2 times. For example, the first item and the fourth item require mastery of attribute 1. In addition, if all items are deterministic (i.e., the probability of a correct response is either 1 or 0), all examinees' attribute patterns could be perfectly identified. The second test, represented by Q2, also measures each attribute at least twice. However, attribute 1 and attribute 2 are confounded. Specifically, even if the probability of a correct response is 1 if all the required attributes are mastered and 0 otherwise, certain attribute patterns could not be identified. Accordingly, the test corresponding to Q1 would be preferred over Q2. Thus, the quality of a test not only depends on the items' ability to separate the examinees into classes, but also that an index used to measure the value of an item, or method of test construction, is incorporated in the Q-matrix.
Cognitive diagnosis models can be sub-divided into two classifications: compensatory models and conjunctive models. Compensatory models allow for examinees who are non-masters of one or more attributes to compensate by being masters of other attributes. An exemplary compensatory model is the common factor model. High scores on some factors can compensate for low scores on other factors.
Numerous compensatory cognitive diagnosis models have been proposed including: (1) the Linear Logistic Test Model (LLTM) which models cognitive facets of each item, but does not provide information regarding the attribute mastery of each examinee; (2) the Multicomponent Latent Trait Model (MLTM) which determines the attribute features for each examinee, but does not provide information regarding items; (3) the Multiple Strategy MLTM which can be used to estimate examinee performance for items having multiple solution strategies; and (4) the General Latent Trait Model (GLTM) which estimates characteristics of the attribute space with respect to examinees and item difficulty.
Conjunctive models, on the other hand, do not allow for compensation when critical attributes are not mastered. Such models more naturally apply to cognitive diagnosis due to the cognitive structure defined in the Q-matrix and will be considered herein. Such conjunctive cognitive diagnosis models include: (1) the DINA (deterministic inputs, noisy “AND” gate) model which requires the mastery of all attributes by the examinee for a given examination item; (2) the NIDA (noisy inputs, deterministic “AND” gate) model which decreases the probability of answering an item for each attribute that is not mastered; (3) the Disjunctive Multiple Classification Latent Class Model (DMCLCM) which models the application of non-mastered attributes to incorrectly answered items; (4) the Partially Ordered Subset Models (POSET) which include a component relating the set of Q-matrix defined attributes to the items by a response model and a component relating the Q-matrix defined attributes to a partially ordered set of knowledge states; and (5) the Unified Model which combines the Q-matrix with terms intended to capture the influence of incorrectly specified Q-matrix entries.
Another aspect of cognitive diagnostic models is the item parameters. For the DINA model, items divide the population into two classes: (i) those who have all required attributes and (ii) those who do not. Let ξij be an indicator of whether examinee i has mastered all of the required attributes for item j. Specifically,
            ξ      ij        =                  ∏                  k          =          1                K            ⁢              α        ik                  q          jk                      ,where αj is a (K×1) 0/1 vector such that the kth element for the ith examinee, αik, indicates mastery, or non-master, of the kth attribute.
Given ξij, only two parameters sj and gj, are required to model the probability of a correct response. sj represents the probability that an examinee answers an item incorrectly when, in fact, the examinee has mastered all of the required attributes (a “slipping” parameter). Conversely, gj represents the probability that an examinee answers an item correctly when, in fact, the examinee has not mastered all of the required attributes (a “guessing” parameter).sj=P(Xij=0|ξij=1)gj=P(Xij=1|ξij=0)
If the jth item's parameters and ξij are known, the probability of a correct response can be written as:P(Xij=1|ξij,sj,gj)=(1−sj)ξijgj(1−ξij) 
The guess and slip parameters indicate how much information an item provides. If the slip parameter is low an examinee who has mastered all of the required attributes is likely to correctly answer the question. If the guess parameter is low, it is unlikely that an examinee missing at least one of the required attributes correctly responds to the item. Therefore, when sj and gj are low, a correct response implies, with almost certainty, that the examinee has mastered all required attributes. As the values of sj and gj increase, the item provides less information, and attribute mastery is less certain. Therefore, a measure that indicates the value of an item should be largest when both sj and gj are 0 (i.e., the item is deterministic) and should decrease as the values of sj and gj increase.
One concern is that the DINA model partitions the population into only two equivalence classes per item. Such a model may thus be viewed as an oversimplification since missing one attributed is equivalent to missing all required attributes. In some situations, it might be realistic to expect that an examinee lacking only one of the required attributes has a higher probability of a correct response as compared to an examinee lacking all of the required attributes. A number of models consider such a possibility, such as the NIDA model and the RUM model.
The NIDA model accounts for different contributions from each attribute by defining “slipping,” sk, and “guessing,” gk, parameters for each attribute, independent of the item. The probability of a correct response is the probability that all required attributes are correctly applied. Specifically, since all slipping and guessing parameters are at the attribute level instead of the item level, a new latent variable ηijk is defined at the attribute level, such that ηijk is 1 if attribute k was correctly applied by examinee i on item j and 0 otherwise. sk and gk can thus be defined in terms of ηijk given the Q-matrix and examinee's attribute mastery as:sk=P(ηijk=0|αik=1,qjk=1)gk=P(ηijk=1|αik=0,qjk=1)
As such, the probability of a correct response is equal to the probability that all required attributes are correctly applied. The NIDA model defines the probability of a correct response as:
      P    ⁡          (                                    X            ij                    =                      1            |                          α              i                                      ,        s        ,        g            )        =            ∏              k        =        1            K        ⁢                  [                                            (                              1                -                                  s                  k                                            )                                      α              ik                                ⁢                      g            k                          1              -                              α                ik                                                    ]                    q        jk            where s={s1, . . . , sk} and g={g1, . . . , gk}.
In this model, no specific item parameters are used. Since the guessing and slipping parameters for the NIDA model are for each attribute, only the Q-matrix distinguishes differences among items. Any two items that require the same attributes (i.e., the entries in the Q-matrix are identical) contribute equally to the estimation of an examinee's attribute pattern. in constructing a test, the value of a particular item then depends upon the attribute parameters and the Q-matrix. For example, if one attribute had low sk and gk, an examinee must have that attribute to correctly answer any question that requires that attribute (i.e., there is a low probability of correctly guessing the answer when the attribute is absent and a low probability of slipping if the attribute is known). Thus, a single response can provide sufficient information about the attribute's mastery. In contrast, if an attribute has high slipping and guessing parameters, the attribute should be measured by more items to ensure adequate information regarding the attribute.
In an alternate NIDA model, the slipping and guessing parameters are estimated separately for each item. Accordingly, the probability of a correct response for the jth item is:
      P    ⁡          (                                    X            ij                    =                      1            |                          α              i                                      ,        s        ,        g            )        =            ∏              k        =        1            K        ⁢                  [                                            (                              1                -                                  s                  jk                                            )                                      α              ik                                ⁢                      g            jk                          1              -                              α                ik                                                    ]                    q        jk            
In this model, items with low guessing and slipping parameters across all attributes are more informative about examinees' attribute patterns. Items having low guessing and slipping parameters better discriminate between examinees since only those examinees with all of the required attributes are likely to correctly answer the question. Moreover, those items having low guessing and slipping parameters for particular attributes provide more information about that attribute than for attributes having higher guessing and slipping parameters.
The Reparameterized Unified Model (RUM) extends the NIDA model by incorporating a continuous latent variable θi to account for any attributes not otherwise specified in the Q-matrix. This model utilizes a parameterization that eliminates a source of unidentifiability present in the NIDA model. In particular, to solve the identifiability problem the model includes a parameter that defines the probability of getting an item correct given that all required attributes have been mastered (denoted by πj*). Using the parameters of the extended NIDA model:
      π    j    *    =            ∏              k        =        1            K        ⁢                  (                  1          -                      s            jk                          )                    q        jk            
Also, a penalty for each attribute that is not mastered for the jth item, rjk*, is defined as:
      r    jk    *    =            g      jk              1      -              s        jk            
RUM allows for the possibility that not all required attributes have been explicitly specified in the Q-matrix by incorporating a general ability measure, Pcj(θi). Specifically, using RUM, the probability of a correct response can be written as:
            P      ⁡              (                                            X              ij                        =                          1              |                              α                i                                              ,                      θ            i                          )              =                  π        j        *            ⁢                        ∏                      k            =            1                    K                ⁢                              r            jk                          *                              (                                  1                  -                                      α                    ik                                                  )                            ⁢                              q                jk                                              ⁢                                    P                              c                j                                      ⁡                          (                              θ                i                            )                                            ,where Pcj is the logistic Rasch Model item characteristic curve with difficulty parameter cj and θi is a general measure of the ith examinee's knowledge not otherwise specified by the Q-matrix.
For each attribute not mastered, P(Xij=1|αi,θi) is reduced by a factor of rjk*. Items having high πj*'s and low rjk*'s provide the most information about examinees' attribute patterns. In addition, the rjk*'s can provide some information about the Q-matrix. Specifically, if an rjk* is close to 1, the probability of a correct response is approximately the same for those examinees who have or have not mastered the kth attribute for item j (assuming all other attributes are held constant). Thus, it is likely that the kth attribute is not required for the jth item and qjk should be set to 0. As in the NIDA models, items with low attribute patterns (i.e., rjk*) provide more information about examinee attribute mastery than when rjk*'s are high.
The NIDA and RUM models assume a discrete latent space characterized by mastery or non-mastery of K attributes. However, some conjunctive models assume a latent space defined by K continuous attributes. For example, the MLTM model using the Rasch model assumes that performance on a particular item requires K attributes where k={1, . . . , K}. Given an examinee's ability, the probability that the kth attribute is completed correctly equals the probability as defined by the Rasch model:
                    P        jk            ⁡              (                  θ          i                )              =                  ⅇ                  (                                    θ              i                        -                          b              jk                                )                            1        +                  ⅇ                      (                                          θ                i                            -                              b                jk                                      )                                ,where bjk is the difficulty parameter representing the difficulty of correctly applying the kth task for the jth item.
The model also assumes that, given θ, all tasks are independent, so the probability of correctly answering an item is:
            P      ⁡              (                                            x              ij                        =                          1              |                              θ                i                                              ,                      b            ij                          )              =                            (                                    s              j                        -                          g              j                                )                ⁢                              ∏                          k              =              1                        K                    ⁢                                    ⅇ                              (                                                      θ                    ik                                    -                                      b                    jk                                                  )                                                    1              +                              ⅇ                                  (                                                            θ                      ik                                        -                                          b                      jk                                                        )                                                                        +              g        j              ,where:                gj is the probability an examinee guesses the correct response for item j and        sj is the probability an examinee correctly applies the tasks for item j.        
While CDMs can be useful in the analysis and interpretation of existing tests, specifying how to construct an adequate test using CDMs has been largely ignored.
What is needed is a method and system for developing tests incorporating an index for measuring how informative each item is for the classification of examinees.
A need exists for such a method and system in which indices are specific to each attribute for each item.
A further need exists for a method and system of developing a test in which the indices are used to select items for inclusion in the test based on the indices.
The present disclosure is directed to solving one or more of the above-listed problems.