1. Technical Field
The present teaching relates to methods, systems and programming for data processing. Particularly, the present teaching is directed to methods, systems, and programming for digital data characterization and systems incorporating the same.
2. Discussion of Technical Background
The advancement in the world of the Internet has made it possible to make a tremendous amount of information accessible to users located anywhere in the world. With the explosion of information, new issues have arisen. First, faced with all the information available, how to efficiently and effectively identify data of interest poses a serious challenge. Much effort has been put in organizing the vast amount of information to facilitate the search for information in a more systematic manner. Along that line, different techniques have been developed to classify content into meaningful categories in order to facilitate subsequent searches or queries. Imposing organization and structure on content has made it possible to achieve more meaningful searches and promoted more targeted commercial activities.
In addition to categorizing content, efforts have been made to seek effective representation of data so that processing related to searches and/or queries can be made more efficient in order to identify what a user is asking for. For example, in the context of textual data, traditional information retrieval (IR) systems rely on matching specific keywords in a query to those in the documents to find the most relevant documents in a collection. This is shown in FIG. 1(a) (Prior Art), where an input document 110 is analyzed by a keyword extractor 120 that produces a keywords-based representation of the input document 110. There are a number of well-known retrieval models associated with keyword based approaches, including vector space models, probabilistic models, and language models. Language model based IR approaches include the use of, e.g., unigram, bi-gram, N-gram, or topics. Although such language model based approaches have attracted much attention in the IR field, they have various limitations. In practice, use of a language model that is more complex than a simple unigram-based model is often constrained due to computational complexity. Another drawback associated with a traditional keyword based approach is related to synonymy and polysemy of keywords.
In an attempt to mitigate these drawbacks in connection with keywords-based approaches, data representation and search based on semantics of an input document have been developed. In semantic based systems, the focus has shifted from keywords to the meaning of a document. This is depicted in FIG. 1(b) (Prior Art), where an input document 160 is analyzed first by a feature extractor 170 that produces a feature vector. The feature vector is then forwarded from the feature extractor 170 to a semantic estimator 180, which analyzes the input data and determines the semantics of the input document. The semantic estimator produces a semantic-based representation of the input document 160. Such semantic-based representation can be stored and used in future searches. In implementing the semantic estimator 180, natural language processing techniques have been employed to understand the meaning of each term in queries and documents.
Such techniques sometimes use taxonomies or ontological resources in order to achieve more accurate results. The enormous effort involved in such systems prompted development of automated methods that can learn the meaning of terms or documents from a document collection. For example, a so-called autoencoder (known in the art) has been developed for learning and subsequently extracting semantics of a given document. Such an autoencoder may be deployed to implement the semantic estimator 180. In this case, an autoencoder takes the feature vector shown in FIG. 1(b) as an input and then identifies the most relevant features that represent the semantics of the input document 160.
An autoencoder uses an artificial neural network for learning an efficient coding. By learning a compressed representation for a set of data, an autoencoder provides a means for dimensionality reduction and feature extraction. The concept of autoencoder was originally used for imaging compression and decompression. Recently, it has been adopted for and applied to textual information to learn the semantic features in a text collection. The compact semantic codes output from an autoencoder can be used both to represent the underlying textual information and to identify similar documents. Due to the fact that the input dimensionality of the autoencoder must be limited to make training tractable, only a small subset of the corpus vocabulary can be used to contribute to the semantic codes. Because of that, the semantic codes output from an autoencoder may not adequately capture the semantics of an input document. In addition, document collections in many retrieval applications are often updated more often than training can practically be done due to the computational cost of training. These limitations raise the question of whether the resulting condensed semantic code provides a sufficiently accurate representation of the information in the original feature space.
Another existing automated technique, called Trainable Semantic Vectors (TSV), learns the meaning of each term extracted from a document collection with regard to a predefined set of categories or topics, and creates a semantic vector for each document. Such generated semantic vector can then be used to find similar documents. However, TSV is a supervised learning technique, which requires pre-categorized documents in order to properly train the TSV to obtain a semantic representation model for each term.
Another automated method called Latent Semantic Indexing (LSI) identifies latent semantic structures in a text collection using an unsupervised statistical learning technique that can be based on Singular Value Decomposition (SVD). Major developments along the same line include probabilistic Latent Semantic Indexing (pLSI) and Latent Dirichlet Allocation (LDA). Those types of approaches create a latent semantic space to represent both queries and documents, and use the latent semantic representation to identify relevant documents. The computational cost of these approaches prohibits the use of a higher dimensionality in the semantic space and, hence, limits its ability to learn effectively from a data collection.
The above mentioned prior art solutions all have limitations in practice. Therefore, there is a need to develop an approach that addresses those limitations and provides improvements.