In a survey coding process, a huge amount of responses is collected during a survey conducted in a particular domain in response of open-ended questions asked during the survey. Since, the questions asked during the survey are generally open-ended questions, the responses received are open-ended in nature. The responses, provided by respondents, typically tend to be extempore and are given in a free style, and hence they do not conform to standard rules of language. The respondent may be a person responding to the open-ended questions. The responses/answers of the open-ended questions may vary widely from one respondent to another respondent. The primary reason behind this variation is language usage and writing style of different persons responding to the open-ended questions. These responses (i.e., answers to the open-ended questions) are received in different formats such as hand-written papers, scanned copies, images, videos, and the like. These may be normalized in an electronic text data for analysis. Thus, the electronic text data received is considered to be unstructured in nature. To make sense out of such electronic text data which is considered to be unstructured, a survey coding is provided.
In the survey coding, a set of predefined tags or labels or codes may be provided for tagging the electronic text data (hereinafter electronic text data is referred to as ‘text snippets’). The tagging is performed to classify these text snippets in a form understandable by various computer software for analysis. For tagging the text snippets, human coders are used which makes the survey coding subjective. The subjectivity is due to the varying level of domain knowledge, language skills, experience of the human coders, as well as inherent ambiguity in the text snippet or ambiguity due to large number of tags. For example, two different human coders may tag a text snippet differently. Also, it has been observed that, in some cases, even the same human coder assigns different tags to same text snippet at different times depending on training, domain understanding, and external factors such as pressure to complete the survey coding process under a tight schedule. Thus, maintaining uniformity while assigning the tags to the text snippets becomes a challenging task in the survey coding.
Another solution present in art for automating the survey coding process is based on text classification using supervised machine learning techniques. But, in such supervised machine learning techniques, the availability of labeled training data, before starting survey coding process, is one of a concern. In many cases, these labeled training data are not available readily and has to be created manually. Further, the cost and effort required for creating such labeled training data outweighs the benefits of using supervised learning techniques. Moreover, the system may have to refer domain-specific labeled training data each time and update the supervised learning model while assigning the tags to different set of electronic text data. The dependency of using such labeled training data results in an increase in computing time of a system during the survey coding. Thus, the requirement of such labeled training data in the supervised learning approach is another challenge in the survey coding process.