As organizations generate and retain large amounts of data, it becomes increasingly important for those organizations to effectively classify this data. Data may be classified using information about the data, such as the names of data files, the names of the users that created or modified the data, or the source or location of the data. However, information about the data may not be useful to all users because such information lacks context. For example, when viewing a document data file, information about a document does not provide the document's purpose or its subject matter. Only someone who has previously interacted with the document can provide this contextual information. Therefore, in order to give context to users who encounter a document for the first time, many organizations will use the document's content as a way to classify it. For example, documents pertaining to a certain subject or category may be located in the same folder, and the folder may be labeled with the name of the category. Classifying data by content makes it easier for a user to identify the data's general subject matter or purpose without having to review the data himself.
It can be extremely tedious to determine the content of data, since it requires that someone or something review the data and provide a category for classification. In situations where the data is a document file, someone must read the document to identify the appropriate classification for that document. In order to increase efficiency, organizations are developing and applying various “classifier algorithms” in order to automate the data classification process. Applying a classifier algorithm to classify data is known as “statistical classification,” since the algorithm is typically based upon a statistical analysis of the data for classification. Statistical classification may apply one or more known classifier algorithms, such as Naïve Bayes, Maximum Entropy, Support Vector Machines, Decision Tree, Neural Network (Multi-layer Perceptron), k-Nearest Neighbors, Gaussian Mixture Model, Gaussian, etc. A programmer may write a computer program to statistically classify one or more sets of data files into common categories based on information within the files, i.e. their content. The computer program may also “study” a set of documents that have been previously classified by category in order to locate other documents that share the classified documents' category. In either case, the computer program will typically require “training” or “machine learning” in order to learn what content matches which category.
An exemplary training session may begin with an initial set of data or document files, and the computer classifier algorithm program may be given a list of terms or patterns to search for in the initial data set. The computer program may also be given a list of categories that are appropriate for those terms or patterns. The computer program thereby “learns” how to classify data by looking for keywords or patterns, then matching those keywords or phrases to a pre-determined category. Thus, when the training session is complete, the computer can run the program on other sets of data during one or more classification sessions. During these later classification sessions, the computer program will apply what it learned from the training session and will search for certain terms or patterns in a data file, and once a sufficient amount of these terms or patterns are found in the file, the computer program will assign a category, or classify, the file.
Once the training session has ended, there is often no way to add additional terms, patterns or categories to the classifier algorithm program. The accuracy of the classifier algorithm program is often dependent upon the initial training session. The more terms, patterns and matching categories that are provided to the computer during the training session, the more accurate the classifier algorithm results. Therefore, it is important that during the training session, the computer receives every possible term and pattern, as well as every possible category, to ensure that the particular classifier algorithm may be used in future classification sessions. However, it is difficult to predict the content of future data. As a result, terms and patterns and/or categories missing from the initial training session can handicap the subsequent classification sessions, resulting in inaccurate data classifications. What is therefore needed is a way to implement a classifier algorithm that accepts new terms, patterns and categories, even after a training session. In other words, what is needed is a classifier algorithm that can “evolve” for future sessions, or that can build upon the knowledge gained from previous sessions.
The concept of algorithms that evolve, i.e., “evolutionary algorithms,” is often applied in situations where solutions are mere optimizations rather than concrete and consistent results, or where there may not be exact solutions for a problem. Evolutionary algorithms are commonly used in search engine technology, and include Hill Climbing, Simulated Annealing, Genetic Algorithms and other mathematical optimization techniques. These algorithms apply concepts inspired by evolutionary biology, namely, recombination, selection, mutation and reproduction. In general, an evolutionary algorithm works with an initial set of data, called an “initial population.” The initial population may be a randomly selected set of solutions for a problem. For example, if the problem is the fastest route to a particular destination on a map, the initial population may be a four or five routes selected from an infinite number of pathways. From this initial population, two or three may be selected as the shortest routes. Two of these routes may be then combined to result in a new offspring route population, i.e., recombination. Alternatively or additionally, a change to one of the routes, or “mutation,” may be introduced, resulting in a different offspring population. The offspring populations of recombination or mutation may then be used to render another set, or population of routes. Each new population creation session can be thought of as a successive generation. Depending upon the way the routes are viewed, the “best” route or routes may ultimately emerge from each subsequent generation.
As previously mentioned, evolutionary algorithms are effective when trying to discover optimized solutions for a problem with changing variables. For example, evolutionary algorithms are typically used in Internet search engine technology because the population of search results continues to grow, and each search is likely to yield different and often better results each time the search is performed. Evolutionary algorithms are also useful in robotics, engineering design, cryptography and other multi-variable problems. Because of its inherent ability to solve these types of problems, it would be beneficial to apply an evolutionary algorithm to the field of data classification.
However, simply applying an evolutionary algorithm does not guarantee optimal or easily understood results. These types of algorithms are typically left to solve problems independently by applying automated encoded mathematical operations. Similar with other encoded classifier algorithms, these operations and computer programs are not readable by humans because they resemble computer program code or utilize mathematical notations that are incomprehensible to most laypeople. As such, even if someone wanted to view, edit or optimize the algorithm, he would be unable to do so easily without having the requisite knowledge. What is therefore needed is a classifier algorithm that enables any user to edit, add or remove terms, patterns and categories used for classification.
In general, what is needed is a more improved and easily implemented classifier algorithm and method. This improved method should be able to classify data, such as document files, without regard to origin or subject. What is further needed is a way to effectively monitor the method of classification in order to improve the method. What is further needed is a way to easily modify the method of classification, if necessary.