The problem of understanding free form data content is motivated, in part, by a need to search and analyze problem or trouble tickets (PT or TT), a task which cannot be performed effectively on the original free form textual data. An example of an application where a better understanding of the free form data content, as well as a more effective representation of the data for computers are required, is the process followed by Call Center personnel to resolve customer information technology (IT) problems.
A Call Center offers support for customers to help them solve the problems they experience with commercial products. In the case of IT Operations Call Centers, these can include, for example, hardware and software products. A problem ticket is a record of a problem that a customer is experiencing, the subsequent telephone call and electronic mail (e-mail) exchanges with the technical support personnel on that issue, as well as of any other information that the technical support personnel considers relevant to describing or solving the issue. Thus, when a technical support personnel needs to solve a problem, he or she can first check to see if the problem has been reported for another customer. If it has, the technical support personnel can read how to fix the problem and avoid spending the time trying to solve problems that other people have already solved.
The information of interest (for example, the problem description and resolution) for level one and two personnel in a Call Center is recorded by using specific PT management tools, typically in free form text. Thus, most of the useful PT data is not explicitly structured because it is highly noisy (for example, contains inconsistent formatting), and very heterogeneous in content (for example, natural language, system generated data, domain specific terms, etc.), making it difficult to effectively apply common data mining techniques to analyze and search the raw data.
Existing approaches for automatically searching for a particular topic in a PT collection of free form documents retrieve an overwhelmingly large amount of irrelevant tickets, presenting the technical assistance personnel with the tedious work of manually searching for relevant data buried in the free form text.
Existing approaches to discover text features are primarily based on manual construction from extensive experience with the data. The drawback of manually producing features is that an expert needs to read and understand a large volume of data to create a set of relevant features. Some existing approaches have focused on discovering new textual features based on term relationships and on additional resources other than the training data, such as dictionaries. These efforts mainly focus on word and/or phrase tagging, like part-of-speech tagging and name tagging. However, none of these complex text features have been used in labeling units of text to recognize the information type of a particular unit of text rather than only of particular words and/or phrases.
Existing approaches include, for example, U.S. Pat. No. 6,829,734 entitled “Method for discovering problem resolutions in a free form computer helpdesk data set,” which includes a method and structure for discovering problem resolution in a helpdesk data set of problem tickets based on using an enumerated set of phrases that have been identified as indicating diagnosis, instruction, or corrective action.
Another existing approach includes U.S. Pat. No. 6,892,193 entitled “Method and apparatus for inducing classifiers for multimedia based on unified representation of features reflecting disparate modalities,” which includes a method to perform categorization (classification) of multimedia items.
Also, U.S. Pat. No. 7,106,903 is an existing approach entitled “Dynamic partial function in measurement of similarity of objects,” which includes a method of measuring similarity of a first object represented by first set of feature values to a second object represented by a second set of feature values. U.S. Patent Application No. 2003/0154181, entitled “Document clustering with cluster refinement and model selection capabilities,” includes a document partitioning (flat clustering) method that clusters documents with high accuracy and accurately estimates the number of clusters in the document corpus.
Another existing approach includes U.S. Patent Application No. 2003/0167163, entitled “Inferring hierarchical descriptions of a set of documents,” which includes a method for automatically determining groups of words or phrases that are descriptive names of a small set of documents. Also, U.S. Patent Application No. 2003/0226100, entitled “Systems and methods for authoritativeness grading, estimation and sorting of documents in large heterogeneous document collections,” includes a method for determining the authoritativeness of a document based on textual, non-topical cues.
U.S. Patent No. 2006/0026203, entitled “Method and system for discovering knowledge from text documents,” includes a method for discovering knowledge from text documents. Also, U.S. Patent Application No. 2006/0143175, entitled “System and method for automatically classifying text,” includes a method for automatically classifying text into categories.
Also, U.S. Patent Application No. 2006/0179016, entitled “Preparing data for machine learning,” includes a method for feature selection. Additionally, U.S. Patent Application No. 2006/0222239, entitled “Systems and methods for detecting text,” includes employing a boosted classifier and a transductive classifier to provide accurate and efficient text detection systems and/or methods. U.S. Patent Application No. 2006/0255124, entitled “Method and system for discovering significant subsets in collection of documents,” includes a method of discovering a significant subset in a collection of documents, includes identifying a set of documents from a plurality of documents based on a likelihood that documents in the set of documents.
Disadvantages of the existing approaches include providing solutions for text classification that do not directly address the automatic feature generation. Also, disadvantages of the existing approaches addressing feature generation include basing feature generation on traditional keyword features and, in most existing approaches, exclusively using documents that have a known structure.
It would thus be desirable to overcome the limitations in previous free form data feature discovery approaches.