1. Field of the Invention
The present invention generally relates to a conversion and storage technique for representing occurrences of dictionary terms in a document corpus including unstructured text documents. This representation in random access memory (RAM) would typically be used in data mining of the document corpus. Specifically, the invention provides a small sparse matrix considerably smaller than the conventional sparse matrix and dense matrix representations.
2. Description of the Related Art
Free form computer helpdesk data sets consist primarily of short text descriptions, composed by the helpdesk operator for the purpose of summarizing what problem a user had and what was done by the helpdesk operator to solve that problem. A typical text document (known as a problem ticket) from this data set consists of a series of exchanges between an end user and an expert helpdesk advisor, for example:                1836853 User calling in with WORD BASIC error when opening files in word. Had user delete NORMAL.DOT and had her reenter Word. she was fine at that point. 00:04:17 ducar May 2″07″05″656PM        
Problem tickets may have only a single symptom and resolution pair as in the above example, or they may span multiple questions, symptoms, answers, attempted fixes, and resolutions—all pertaining to the same basic issue. Problem tickets are opened when the user makes the first call to the helpdesk and closed when all user problems documented in the first call are finally resolved in some way. Helpdesk operators enter problem tickets directly into the database. Spelling, grammar and punctuation are inconsistent. The style is terse and the vocabulary very specialized. Note also that each keyword is used only once or very few times.
One potential benefit to be gained from helpdesk data sets is to “mine”them to discover general categories of problems. Once a meaningful “problem categorization”has been discovered, individual categories can be studied to find automated solutions to address future user problems in this area. A typical example of such a solution would be an entry in a “Frequently Asked Questions”section of a customer support web site.
Unfortunately, it is very difficult to categorize large amounts of unstructured text information by hand. Automated methods, such as text clustering, provide a means for computers to quickly create categorizations from unstructured text with little or no intervention from the human expert. Typically text clustering algorithms work by categorizing documents hased on term occurrence, where a “term” is a word or phrase contained in a “dictionary” of commonly occurring and meaningful words and phrases.
To work effectively on large data sets in a short time, such algorithms need a representation of the text corpus that can reside in computer memory (RAM). This representation must indicate for each document in the text corpus the number of times each dictionary term occurs in that document. The size in memory of this matrix is typically the limiting factor which determines how large a text corpus may be categorized in a given computer hardware configuration.
To illustrate the various storage representations of a document corpus, FIG. 1 shows a document corpus having three documents. FIG. 2 shows one possible dictionary developed from this same document corpus. FIG. 3 shows a basic representation of the document corpus using a dense matrix. In effect, one axis of the matrix contains an ordered listing of all dictionary terms and the second axis contains an ordered listing of the documents in the corpus. The matrix is then filled with number of occurrences of each dictionary term in each document, and each document can be considered to represent a vector in dictionary space.
FIG. 4 shows the floating point format of the same dense matrix (e.g., the preferred format for data mining algorithms). This floating point format represents the same information as the integer format but each document is now “normalized”into unit vectors, thereby eliminating the effect of document length. As can be easily seen, each floating point number is the integer value multiplied by the reciprocal of the square root of the summation of integer values squared, the well known process of normalizing a vector, where the vector is considered as a matrix row. If the document corpus contains a large number of short documents and the dictionary contains a large number of terms, then it is easy to see that the dense matrix representation would be filled mostly with zeroes since any row representing a document would contain only a few of the many dictionary terms. The matrix in FIG. 4 takes 48 bytes in RAM, assuming a short integer takes two bytes and a floating point number takes four bytes.
In order to conserve space, a sparse matrix representation is usually employed. Such a representation indicates the position and value of each non-zero matrix element. A floating point example is shown in FIG. 5. This matrix is developed by first assigning a unique integer for each dictionary term and then filling in the unique integer corresponding to document words (see FIG. 2). Associated with each document word is a normalization factor, calculated identically to that explained for the dense matrix. The matrix in FIG. 5 would occupy 36 bytes in RAM.
Although the sparse matrix representation of a document corpus provides an improvement in memory requirement over that of the dense matrix, it carries an overhead cost of storing the normalization factor for each term and of keeping track of each separate document. Thus, there remains a need to present document corpus occurrence data in a format that further reduces the RAM requirement.