Information retrieval (IR) techniques have attracted much attention during the past decades since people are frustrated by being drowned in huge amount of data while still being unable to obtain useful information. Vector Space Model (VSM) is the foundation of many information retrieval techniques, which is used to represent the text documents and define the similarity among them. Bag of Word (BOW) is the earliest approach used to represent document as a bag of words under the VSM. In the BOW representation, a document is encoded as a feature vector with each element in the vector indicating the presence or absence of a word in the document by Term Frequency and Inverted Document Frequency (TFIDF) indexing. However, the major limitation of BOW is that it only retains the frequency of the words in the document and loses the sequence information.
In the past decade, attempts have been made to incorporate the word-order knowledge with the vector space representation. N-gram statistical language model is a well-known one among them. The entries of the document vector by N-gram representation are strings of n consecutive words extracted from the collections. The entries are effective approximations. The entries not only contain the word-order information but also solve the language independent problem. However, the high-dimensional feature vectors of the entries make many powerful information retrieval technologies, e.g., Latent Semantic Indexing (LSI) and Principal Component Analysis (PCA), unfeasible for large dataset.
During the past few years, the IR researchers have proposed a variety of effective representation approaches for text documents based on VSM. Such a representation facilitates the classification of a document into an associated class. However, since the volume of available text data is increasing very fast, researchers often ask, “Are the further improvements likely to require a broad range of techniques in addition to the IR area?” Consequently, one may seek a new model for text documents representation based on new techniques.
Being able to classify documents in an effective and efficient manner is becoming more important with the increasing number of available documents on networked computer systems. A model for representing the content of a document would assist in classifying the document.