Automated scoring models for scoring constructed responses of test takers (examinees) are known in the art and are conventionally trained using a set of training responses to a given test and using associated human-assigned scores. The present inventors have observed a number of potential shortcomings in the conventional training approach, however. For example, the conventional model-training process assumes that the human-assigned scores are reliable. In practice, however, such assumption is often too optimistic, as certain features of the training responses, such as response length, may have unduly influenced the human scorers' evaluations. Consequently, scoring model trained using the traditional process may undesirably reflect such undue bias (e.g., the scoring model may assign unduly high weights to response lengths). In addition, the scoring model may be more susceptible to being “gamed.” For example, an examinee who knows that a scoring model places significant weight on response length may attempt to get a better score by lengthening its response without adding substance. Another shortcoming is that such a model may unfairly disadvantage certain populations. For example, if essay length is heavily weighted, an Arabic examinee who is not accustomed to writing from left to right might not generate responses that are as lengthy as those generated by examinees who are so accustomed. Moreover, a scoring model trained to predict human-assigned single test scores may not be the best predictors of more robust measures of writing ability and therefore may not be the most diagnostically useful indicators of performance. Thus, the present inventors have observed a need for an improved method for generating an automated scoring model.