This invention relates to methods for evaluating standardized test administrations to a population of subjects to determine whether a test was properly administered to a particular sub-group, such as a class, for example. The members of such a sub-group will most often be defined by the test administrator or proctor for the sub-group, but may also be defined by any other element common to the subjects of the sub-group and effecting test administration. The sub-group may result from a single, group-administered standardized test session or cumulatively from many single- or group-administered test sessions of the same standardized test. While the focus of the specification that follows is most often on standardized test administrations to students in school for the purpose of evaluating the quality of the test administration procedures employed by the test administrator (teacher or proctor), the focus may also be, for example, on the conditions of the classroom, or the effects of certain extra-curricular activities or other school programs.
With the increasing use of standardized tests, particularly in primary and secondary education, it has become increasingly important to monitor the manner in which standardized tests are administered. In particular, if a standardized test is administered in a non-standard way, the resulting test results may not properly indicate the abilities of the individuals taking the test. For example, if the test administrator does not settle a class properly, rushes a class near the end of a test, improperly encourages guessing near the end of a test, improperly suggests answers, or in other ways helps individuals improperly, the validity of the standardized test is jeopardized.
This problem has been recognized by Gregory J. Cizek in a recently published book entitled, Cheating On Tests: How To Do It, Detect It, And Prevent It (Lawrence Erlbaum Associates, Mahwah, N.J., 1999), particularly in the discussion at pages 62-69. Cizek goes on to discuss several statistical methods for detecting cheating by individual students, not misadministration of a test for an entire class.
One statistical approach to the detection of misadministration of tests is that provided by the Wonderlic ATB Quarterly Report. Each school submits information identifying each applicant who is tested, his or her total test score (number of correct answers), the number of the last question attempted by the applicant, and the program of training the applicant has applied for. This information is then used to provide a comprehensive listing of tested students and an analysis of the potential for problems in the test administration.
The Wonderlic analysis of potential problems relies on two features of the applicants' tests scores:                1. The distribution of total test scores among all applicants tested at the school for each specific program of training should show a Gaussian distribution if the tests are properly administered. Based on the number of applicants who score each possible test score, gaps in the distribution or unusually high concentrations of scores are taken as indications of misadministered tests.        2. The relationship between each applicant's total number of correct answers and the number of questions attempted (last question answered) is assessed. Generally, applicants with low test scores will attempt relatively fewer questions, while applicants with higher scores will attempt more questions. When low-scoring applicants attempt a high number of questions, it may be that the time limit for the test was not observed. When an applicant achieves a relatively high score but answers relatively few questions, the high accuracy rate is suspect, and may be an indication that the applicant received inappropriate help.        
In the Wonderlic methods, the evaluation by standard deviation is primarily limited to larger groups of subjects (50 or more) while the evaluation based on the number of questions attempted is limited to tests that are time restricted, where most subjects fail to complete the test. For evaluation among school classrooms, a method is required for use with smaller groups (30 or fewer), for tests that may be completed by all students, and that is sensitive to multiple methods of improper test administration.
In spite of the approaches discussed above, a need presently exists for an improved method and system for assessing whether or nor a standardized test was administered properly.