Response generation systems, also known as dialog systems or conversational agents, are becoming increasingly common in a variety of systems and devices. Response generation systems include applications and computer systems designed to interpret natural language input messages and output natural language responses. However, these systems frequently output low quality responses that are not actually relevant or appropriate to the conversation.
Although some machine translation evaluations utilize metrics, there are currently no methods or metrics for automatically judging quality of responses generated in human-machine conversational systems. Machine translation may also be referred to as automated language translation. Without a metric for assessing the quality of a machine-generated response, response generation systems cannot be automatically optimized to improve the quality of the machine generated responses.
To improve the quality of the responses, a human user is required to manually review and assess the quality of each machine generated response and manually adjust the response generation system in an attempt to improve the response quality. However, manual human evaluation costs may be prohibitive. Manual human evaluation results may also be inconsistent. Furthermore, manual assessment and tuning cannot scale with production-scale response generation systems having potentially hundreds or thousands of parameters to adjust for optimization of the system. This manual review process is also time consuming, cumbersome, tedious, inefficient, and suboptimal.