The invention generally relates to duplicate bug report detection, and more particularly, to a method and system for duplicate bug report detection including detection of dissimilar duplicate bug reports.
Generally, defects also referred to as bug reporting is an integral part of a software development, testing and maintenance process. Typically, bugs are reported to an issue tracking system which is analyzed by a resource who has the knowledge of the system, project and developers for performing activities like: quality check to ensure if the report contains all the useful and required information, duplicate bug detection, routing it to the appropriate expert for correction and editing various project-specific metadata and properties associated with the report (such as current status, assigned developer, severity level and expected time to closure). It has been observed that often a bug report submitted by a tester or end user is a duplicate. Two bug reports are said to be duplicates if they describe the same issue or problem and thereby have the same solution to fix the issue of an existing bug report. Studies show that the percentage of duplicate bug reports can be up-to 25-30%.
Duplicate bug reports can be classified into two types. The first type of duplicate bug reports is classified as the similar duplicate bug reports that describe the same problem using similar vocabulary. The second type of duplicate bug reports are classified as dissimilar duplicate bug reports that describe different problems but share the same underlying cause. Currently the technology in the area of duplicate bug report detection involves the use of Natural Language Processing and Information Retrieval techniques to identify bug reports with similar vocabulary. Techniques also exist to detect certain types of bug reports with different vocabulary such as synonym replacement, semantic matching using WordNet etc.
However, the existing techniques can only detect duplicate bug reports with similar text and cannot detect dissimilar duplicate bug reports as they do not share common words. Also, synonym replacement techniques do reasonably well only when two bug reports describe the same problem using different words but totally fail in the case of dissimilar duplicate bug reports. This is because while the underlying cause for the two may be the same, they are describing separate problems so the vocabulary for the two will be completely different. There is no system where both the type of duplicates can be detected at once in real time scenario
Hence, there is a need of a method and system for detection of duplicate bug reports. Further, there is also a need of a method and system can be used in an online scenario for detection of all the types of duplicates.