With the explosive growth of available information sources it has become increasingly necessary for users to utilize information mining techniques to find, extract, filter, and evaluate desired information. Human translation is generally laborious, expensive, and error-prone and not a feasible approach for extracting desired information.
Automating information mining techniques to mine information in text documents can be difficult because the text documents are in human readable and understandable format that lack inherently defined structure and appears as meaningless data for the information mining techniques, because text can come from various sources, such as a database, e-mail, Internet and/or through a telephone in different forms. Also, text documents coming from various sources can be high dimensional in nature containing syntactic, semantic (contextual) structure of words/phrases, temporal and spatial information which can cause disorderliness in the information mining process.
Current information mining techniques such as hierarchical keyword searches, statistical and probabilistic techniques, and summarization using linguistic processing, clustering, and indexing dominate the unstructured text processing arena. The most prominent and successful of the current information mining techniques require huge databases including domain specific keywords, comprehensive domain specific thesauruses, computationally intensive processing techniques, laborious human interface and human expertise.
There has been a trend in the development of information mining techniques to be domain independent, to be adaptive in nature, and to be able to exploit contextual information present in text documents to improve processing speeds of information mining techniques. Current techniques for information mining use self-organizing maps (SOMs) to exploit the contextual information present in the text. Currently, SOMs are the most popular artificial neural network algorithms. SOMs belong to a category of competitive learning networks. SOMs are generally based on unsupervised learning (training without a teacher), and they provide a topology that preserves contextual information of unstructured document by mapping from a high dimensional data (unstructured document) to a two dimensional map (structured document), also called map units. Map units, or neurons, usually form a two dimensional grid and hence the mapping from high dimensional space onto a plane. Thus, SOMs serve as a tool to make clusters for analyzing high dimensional data. Word category maps are SOMs that have been organized according to word similarities, measured by the similarity between short contexts of the words. Contextually interrelated words tend to fall into the same or neighboring map nodes. Nodes may thus be viewed as word categories.
Current pending U.S. patent application Ser. No. 09/825,577, dated May 10, 2002, entitled “INDEXING OF KNOWLEDGE BASE IN MULTILAYER SELF-ORGANIZING MAPS WITH HESSIAN AND PERTURBATION INDUCED FAST LEARNING” discloses such an information technique using the SOMs that is domain independent, adaptive in nature that can exploit contextual information present in the text documents, and can have an improved learning rate that does not suffer from losing short contextual information. One drawback with this technique is that the histogram formed from the clusters is very much dependent on the clusters and is very specific and sensitive to the cluster boundary. The elements in or near the boundary may suffer from this rigidity. This might have adverse effects on the accuracy of the information mining.
The SOM based algorithm disclosed in the above-mentioned pending application uses heuristic procedures and so termination is not based on optimizing any model of the process or its data. The final weight vectors used in the algorithm usually depend on the input sequence. Different initial conditions yield different results. It is recommended that the alteration of several parameters of the self-organizing algorithm, such as learning rate, the size of update neighborhood and the strategy to alter these parameters during learning from one data set to another will yield useful results. There is a need for an improved adaptive algorithm responsive to changing scenarios and external inputs. There is a further need for uniformity in neighborhood size. There is yet a further need for an algorithm that preserves neighborhood relationships of the input space in the face of bordering neurons that have fewer neighborhoods than others.