A lot of important information is captured as text data in Software Development Life Cycle (SDLC). Software defect management is a vital part of maintenance and evolution phases of the SDLC. During the testing phase as well as real-life usage of software, many defects associated with various aspects of software are reported. Classifying these defects using techniques, for example, a suitable defect classification scheme (such as the Orthogonal Defect Classification (ODC)), IEEE standard 1044, and the like helps to streamline the defect management process and reap multiple benefits such as identifying patterns in the defect reports, faster root cause analysis and so on.
Textual description in a software defect (e.g., software bug) report is very important for understanding of the defect and its subsequent classification as per a given classification scheme. Automatic identification of the defect type from the textual defect description can significantly improve the defect analysis time and the overall defect management process. This has been recognized in the software repository mining research community and multiple solutions have been proposed over the past decade.
The standard data-driven approach such as supervised machine-learning for software defect type classification needs a significant amount of labeled training data to build a predictive model. This labeled dataset is typically created by humans with domain knowledge and expertise. This is clearly an effort-intensive as well as expensive activity. Further, existing approaches for software defect text categorization are based on use of the supervised or semi-supervised machine learning approaches. In the supervised learning approach, one needs a significant amount of labeled training data for each class in order to train the classifier model. The labeled training data consists of a large number of defects which have been manually annotated and validated for the defect type classification as per the applicable classification scheme. Generating this training data needs significant amount of human effort, leading to an expensive process and further uses the available expertise and resources inefficiently. The research community is aware of this challenge and has proposed use of active learning and semi-supervised learning for software defect classification which aim to reduce amount of labeled training data required and in-turn minimize the human annotation effort required. Even though these approaches improve upon the basic supervised learning approach, they still need reasonable human effort to produce the necessary amount of labeled training data to carry out the software defect classification. Additionally, these and other conventional techniques also use features derived from source code and obtained by pre-processing the code that fixes the bug.