This invention relates generally to the field of information mining, and more particularly pertains to an automated intelligent information mining technique.
With the explosive growth of available information sources it has become increasingly necessary for users to utilize information mining techniques to find, extract, filter, and evaluate desired information. Automating such techniques for information mining in text documents can be difficult due to lack of an inherently defined structure and high dimensionality in the available information sources in the text documents. Generally, text documents lack defined structure because text can come from various sources such as a database, e-mail, Internet or through a telephone in different forms. Also, text documents coming from various sources can be highly dimensional (i.e., can contain various types of information that can include more than alphanumeric data such as spatial and temporal data), which can cause disorderliness in the information mining process.
Current information mining techniques such as hierarchical keyword searches, statistical and probabilistic techniques, and summarization using linguistic processing, clustering, and indexing dominate the unstructured text processing arena. The most prominent and successful of the current information mining techniques require huge databases including domain specific keywords, comprehensive domain specific thesauruses, and computationally intensive processing techniques. There has been a trend in the development of information mining techniques to be domain independent, to be adaptive in nature, and to be able to exploit contextual information present in text documents to improve processing speeds of information mining techniques. Current techniques using contextual information present in a text use self-organizing maps (SOMs) to exploit the contextual information present in the text. SOMs are the most popular artificial neural network algorithms. SOMs belong to a category of competitive learning networks. SOMs are generally based on unsupervised learning (training without a teacher), and they provide a topology that preserves contextual information of unstructured document by mapping from a high dimensional data (unstructured document) to a two dimensional map (structured document), also called map units. Map units, or neurons, usually form a two dimensional grid and hence the mapping from high dimensional space onto a plane. Thus, SOMs serve as a tool to make clusters for analyzing high dimensional data. Word category maps are SOMs that have been organized according to word similarities, measured by the similarity between short contexts of the words. Contextually interrelated words tend to fall into the same or neighboring map nodes. Nodes may thus be viewed as word categories.
Currently, the SOMs including contextual information are formed from average contexts using conditional probability for a word that occurs more than once. This methodology may not be globally applicable across varying text documents, and can suffer from losing the short contextual information. Also, current techniques using SOMs have low learning rates, which can be a major bottleneck in achieving higher processing speeds.
Therefore, there is a need in the art for an intelligent information mining technique that can be domain independent, that is adaptive in nature, that can exploit contextual information present in the text documents, and can have an improved learning rate that does not suffer from losing short contextual information.
The present invention provides an automated intelligent information mining technique for various types of information mining applications such as data and text mining applications, identification of a signal from a stream of signals, pattern recognition applications, and/or natural language processing applications. The technique includes receiving product-related queries and respective product-related information from various text sources, and extracting key-phrases from the received product-related information and queries, and further transforming each of the extracted key-phrases into a unique numerical representation. The technique further includes generating two or more layers of contextual relation maps by mapping the transformed product and query key-phrases to two-dimensional maps, respectively using a self-organizing map and a combination of Hessian matrix and Perturbation (function approximation of neighborhood) technique for speeding up the learning process. Further, the technique includes forming word clusters and constructing corresponding key phrase frequency histograms for each of the generated contextual relation maps. The technique then includes generating two or more two-dimensional structured document maps of the product and queries respectively from the constructed key phrase frequency maps and the generated contextual maps using the self-organizing map and a combination of the Hessian matrix and the Perturbation technique. Selecting answers to queries is based on masking the generated product- and queries-related, two-dimensional structured document maps.
Other aspects of the invention will be apparent on reading the following detailed description of the invention and viewing the drawings that form a part thereof.