It is well accepted that the assessment of constructed (open) response items, popularly known as subjective evaluation, provides a much more holistic and accurate assessment of a candidate's skills as compared to selected response items (multiple-choice questions). The primary limitation of a selected response item is that it asks the candidate to choose the right answer, providing implicit hints and the structure of the solution. With the recent interest in MOOCs (Massively Online Open Courseware), scalable education/training and automated recruitment assessment, the interest in automating the assessment of constructed responses has increased manifolds.
There are many examples of successfully using machine learning for constructed response grading. However, the machine learning framework falls short of providing accurate assessment for a number of problems. Secondly, these automated approaches have come under criticism since the test-takers can fake high-scoring responses. For instance, automated assessment of free speech for spoken language skills largely remains an unsolved problem, while on the other hand, it has been shown that automatic essay grading algorithms can be tricked by inserting the right words in random order or writing long essays. One of key limitations of the current techniques is the inability to automatically derive the right set of features with high precision for assessing the response.
In some prior art techniques, the crowd/peers directly evaluates/grades the response from candidates on a rubric and a combination of their grades mimics the grades given by experts. Firstly, these crowd-based approaches do not work for evaluating expert tasks, say a computer program or an advanced electronics question, which the crowd cannot grade with any precision. Secondly, though useful for low-stake scenarios, these techniques continue to be suspect with regard to crowd reliability and drift for mid/high stake assessments.