Data classification is the process of sorting and categorizing data into various distinct classes or categories and is performed for various business or other objectives. For example, data classification is at the root of any machine learning implementation. The data has to be scrambled, structured, and mined to determine the hidden patterns. The knowledge of historical data and their categorization enables prediction of the category of any new data (e.g., problem statement, logged tickets, etc.).
Various algorithms such as LSI (Latent Semantic Indexing), SVM (Support Vector Machines) may be employed to perform data classification. However, these algorithms typically work on a large dataset (popularly called ‘Big Data’) where the number of data or records is huge, of the order of 1 million. In many scenarios, the trained data set will be very small as low as about 400 records to as high as about 50,000 records. Even with the best case scenario of having 50,000 records would not imply that the data has covered a maximum number of possible categories (e.g., a large number of possible issues that the customer encounters). As will be appreciated, any classification algorithm performs well only if it has been trained for as many data types with as much variance as possible. The performance of above mentioned algorithms has therefore been empirically proven to be non-satisfactory while working on a small dataset comprising of small number of data or records.
Further, a dataset is classified into various classes based on some parameters and any new data is categorized into one of those classes for further processing. The boundaries delineating these classes may be precise or imprecise. For the small dataset, the boundaries between the classes are typically imprecise. This is because the machine learning algorithm doesn't get to see enough variance of the data when the training dataset is small. In other words, the data in small dataset is not enough to create clear and precise boundaries. These imprecise boundaries lead to incorrect prediction, thereby reducing the model efficacy and classification accuracy. For example, because of these overlapping class boundaries, an incident ticket or a problem statement may be classified in multiple classes despite being closer to being in a particular class.