For many years, standardized tests have been administered to examinees for various reasons such as educational testing or evaluating particular skills. For instance, academic skills tests (e.g., SATs, LSATs, GMATs, etc.) are typically administered to a large number of students. Results of these tests are used by colleges, universities, and other educational institutions as a factor in determining whether an examinee should be admitted to study at that educational institution. Other standardized testing is carried out to determine whether or not an individual has attained a specified level of knowledge, or mastery, of a given subject. Such testing is referred to as mastery testing (e.g., achievement tests offered to students in a variety of subjects and the results being used for college entrance decisions).
FIG. 1 depicts a sample question and sample direction which might be given on a standardized test. The stem 4, the stimulus 5, responses 6, and directions 7 for responding to the stem 4 are collectively referred to as an item. The stem 4 refers to a test question or statement to which an examinee (i.e., the individual to whom the standardized test is being administered) is to respond. The stimulus 5 is the text and/or graphical information (e.g., a map, scale, graph, or reading passage) to which a stem 4 may refer. Often the same stimulus 5 is used with more than one stem 4. Some items do not have a stimulus 5. Items having a common stimulus 5 are defined as a set.
Items sharing common directions 7 are defined as a group. Thus, questions 8-14 in FIG. 1 are part of the same group.
A typical standardized answer sheet for a multiple choice exam is shown in FIG. 2. The examinee is required to select one of the responses according to the directions provided with each item and fill in the appropriate circle on the answer sheet. For instance, the correct answer to the question 13 stated by stem 4 is choice (B) of the responses 6. Thus, the examinee's correct response to question 13 is to fill in the circle 8 corresponding to choice (B) as shown in FIG. 2.
Standardized tests with answer sheets as shown in FIG. 2 can be scored by automated scoring systems quickly, efficiently, and accurately. Since an examinee's response to each item is represented on an answer sheet simply as a filled in circle, a computer can be easily programmed to scan the answer sheet and to determine the examinee's response to each item. Further, since there is one, and only one, correct response to each item, the correct responses can be stored in a computer database and the computer can be programmed to compare the examinee's response against the correct response for each item, determine the examinee's score for each item, and, after all items have been scored, determine the examinee's overall score for the test.
In recent years, the demand for more sophisticated test items has forced test administrators to move away from standardized tests with strictly multiple choice responses and paper answer sheets. Architectural skills, for instance, cannot be examined adequately using strictly a multiple choice testing format. For example, test administrators have determined that to examine such skills adequately requires standardized tests that pose to the examinee the challenge of drafting a representative architectural drawing in response to a test question. Such a response might, for example, be developed on a computer-aided design (CAD) facility.
Such tests have frustrated the ability of computers to efficiently and accurately score examinees' responses. While an architectural drawing, for example, may contain some objective elements, its overall value as a response to a particular test question is measured to some degree subjectively. Thus, a computer can no longer simply scan in an examinee's responses and compare them to known responses in a database.
Initially, these tests were scored by human test evaluators who viewed the examinee's responses as a whole and scored the responses on a mostly subjective basis. This approach is obviously time consuming, and subjective. Thus, two examinees could submit exactly the same response to a particular item and still receive different scores depending on which test evaluator scored the response. A particular test evaluator might even assess different scores at different times for the same response.
Recently, computer systems have been developed that evaluate the examinee's responses more quickly, efficiently, and objectively. These systems use scoring engines programmed to identify certain features expected to be contained in a correct response. The various features are weighted according to their relative importance in the response. For example, one element of a model response to a particular item in an architectural aptitude test might be a vertical beam from four to six feet in length. The scoring engine for that item will determine whether the beam is in the examinee's response at all (one feature) and, if it is, whether it is vertical (a second feature) and whether it is between four and six feet in length (a third feature). If the beam is not in the response at all, the scoring engine might be programmed to give the examinee no credit at all for the response to that item. A feature such as this which is so critical to the response that the absence of the feature would be deemed a fatal error in the response is referred to as a fatal feature. If, for example, the beam is present and vertical, but is less than four feet long, the scoring engine might be programmed to give the examinee full credit for the existence of the beam, full credit for the fact that the beam is vertical, but no credit for the fact that the beam is less than four feet long. Since the length of the beam is deemed not to be critical to the response in this example, the examinee still receives partial credit for the response to the item. Such a feature is referred to as a non-fatal feature. Thus, the scoring engine determines the existence of all of the features expected in the response for a given item, assesses a score for each feature present, and then adds up the weighted feature scores to determine the item score. When all the items for a particular test for a given examinee have been scored accordingly, the system assesses an overall test score.
Separately, a human test evaluator can score an examinee's response(s) to a particular item, or to a group of items, or to a whole test. Once the computer has finished scoring the test, a test evaluator may then compare the computer generated score to the score assessed by the test evaluator. If the test evaluator disagrees with the computer generated score for a particular item, the test evaluator is forced to change the score for that test manually.
Thus, one problem with the current computer-based scoring systems is that these systems are batch systems and provide no mechanism for a test evaluator to change the computer generated score online (i.e., to interact with the computer to change the score of a particular item as soon as the computer has scored that item rather than having to wait for the computer to score all the items of a test).
Additionally, a test evaluator might determine that the scoring rubric for an item is flawed and that the scoring engine that applies the flawed rubric needs to be changed. Scoring engines are currently changed in one of two ways, depending on the complexity of the change required. If the test evaluator wishes to change only one or more criteria (e.g., the beam in the above architectural test example should be from five to six feet long instead of from four to six feet long), then a change can be effected by changing the criterion in a file called by the scoring engine. If, however, the change is more complex (e.g., the algorithm used for a complex calculation might be changed, a new feature might be added, it might be determined that the material out of which the beam is made is more important than the length of the beam), then a change must be made to the scoring engine's computer program. A computer programmer usually is required to make changes to the scoring engine. Since, in general, test evaluators are not trained computer programmers, the test evaluator is forced to turn over the proposed changes to a computer programmer, wait until the programmer changes the scoring engine, and then score the item(s) of interest again. This process is obviously time-consuming and labor intensive, and requires coordination among several individuals.
Thus, a need exists for a system and method for interactive scoring of standardized test responses. Such a system and method must enable the test evaluator to change the score of a particular item or multiple items in a test, the score of one or more features of a particular item, or the overall score of a test online. Such a system must then use the test evaluator's score(s) to determine the overall test score for the examinees. Such a system must also enable the test evaluator to change the scoring engine online, and then use the changed scoring engine to rescore the item currently being evaluated, and for all subsequent scoring of that item in other tests, or for other examinees.
Although there are various computer-based scoring systems in use, to the inventors' knowledge there is no software system designed specifically to encourage users to monitor and modify the scoring of disparate test items.