Automated examination evaluation tools have been ineffective at accurately capturing the full range of construct evidence in text responses (i.e., information contained in the response that signals the examinee's ability level in the assessment construct of interest). Many automated examination evaluation systems analyze small groups of features or patterns recognized within the text of an examination response as a proxy for an examinee's overall performance on the examination. For example, automated examination evaluation software may search for and detect specific expected responses in the form of words or phrases and add points for each positive response, and subtract points for each negative (e.g., expected, but incorrect) response. Automated examination evaluation software may also search for and detect other proxy features, such as numbers of words, numbers of characters per word, numbers of words per sentence, or other reductive variables.
These types of evaluation tools do not capture the full richness of evidence markers that would be processed by a human grader, and thus, provide an incomplete representation of the ability markers observed in high performing responses. Thus, existing automated scoring tools are unreliable at higher score ranges, are not equitable to the examinees, and are difficult to defend in terms of construct relevance, since they often produce a reasonably accurate performance at predicting scores for a majority of responses, but noticeably poor performance for students at specific (usually extreme) ability levels (e.g., the highest- or lowest-performing students). These types of evaluation systems are also easy to “game” by, for example, writing essays with many multi-syllable words. Moreover, many automated examination systems also do not accommodate a large number of feature inputs, and do not effectively combine features of substantially different characteristics.