a) Field of the Invention
The invention relates to a system and a method for automated language and text analysis by the formation of a search and/or classification catalog, with data records being recorded by means of a linguistic databank and with speech and/or text data being classified and/or sorted on the basis of the data records (keywords and/or search terms). The invention relates in particular to a computer program product for carrying out this method.
b) Description of the Related Art
In recent years, the importance of large databanks, in particular databanks linked in a decentralized form, for example by networks such as the world wide backbone network Internet, has increased exponentially. More and more information, goods and/or services are being offered via such databanks or networks. This is evident just from the omnipresence of the Internet nowadays. The availability and amount of such data in particular has now resulted, for example, in Internet tools for searching for and finding relevant documents and/or for classification of documents that have been found having incredible importance. Tools such as these for decentralized databank structures or databanks in general are known. In this context, the expression “search engines” is frequently used in the Internet, such as the known Google™, Alta Vista™ or structured presorted link tables such as Yahoo™.
The problem involved in searching for and/or cataloging of text documents in one or more databanks include the following: (1) indexing or cataloging of the content of the documents to be processed (content synthesis), (2) processing of a search request of the indexed and/or catalogued documents (content retrieval). The data to be indexed and/or catalogued normally comprises unstructured documents, such as text, descriptions, links. In more complex databanks, the documents may also include multimedia data, such as images, voice/audio data, video data, etc. In the Internet, this may for example be data which can be downloaded from a website by means of links.
US patent Specification U.S. Pat. No. 6,714,939 discloses a method and a system such as this for conversion of plain text or text documents to structured data. The system according to the prior art can be used in particular to check for and/or to find data in a databank.
Neural networks are known in the prior art and are used, for example, to solve optimization tasks, for pattern recognition, and for artificial intelligence, etc. Corresponding to biological nerve networks, a neural network comprises a large number of network nodes, so-called neurons, which are connected to one another via weighted links (synapses). The neurons are organized and interconnected in network layers. The individual neurons are activated as a function of their input signals and produce a corresponding output signal. A neuron is activated via an individual weighting factor by the summation over the input signals. Neural networks such as these have a learning capability in that the weighting factors are varied systematically as a function of predetermined exemplary input and output values until the neural network produces a desired response in a defined predictable error range, such as the prediction of output values for future input values. Neural networks therefore have adaptive capabilities for learning and storage of knowledge, and associated capabilities for comparison of new information with stored knowledge. The neurons (network nodes) can assume a rest state or an energized state. Each neuron has a plurality of inputs and one, and only one, output, which is connected to the inputs of other neurons in the next network layer or, in the case of an output node, represents a corresponding output value. A neuron changes to the energized state when a sufficient number of inputs to the neuron are energized above a specific threshold value of that neuron, that is to say when the summation over the inputs reaches a specific threshold value. The knowledge is stored by adaptation in the weightings of the inputs of a neuron and in the threshold value of that neuron.
The weightings of a neural network are trained by means of a learning process (see for example G. Cybenko, “Approximation by Superpositions of a sigmoidal function”, Math. Control, Sig. Syst., 2, 1989, pp 303-314; M. T. Hagan, M. B. Menjaj, “Training Feedforward Networks with the Marquardt Algorithm”, IEEE Transactions on Neural Networks, Vol. 5, No. 6, pp 989-993, November 1994; K. Hornik, M. Stinchcombe, H. White, “Multilayer Feedforward Networks are universal Approximators”, Neural Networks, 2, 1989, pp 359-366 etc.).
In contrast to supervised learning neural nets, no desired output pattern is predetermined for the neural network for the learning process of unsupervised learning neural nets. In this case, the neural network itself attempts to achieve a representation of the input data that is as sensible as possible. So-called topological feature maps (TFM) such as Kohonen maps are known, for example, in the prior art. In the case of topological feature maps, the network attempts to distribute the input data as sensibly as possible over a predetermined number of classes. In this case, it is therefore used as a classifier. Classifiers attempt to subdivide a feature space, that is to say a set of input data, as sensibly as possible into a total of N sub-groups. In most cases, the number of sub-groups or classes is defined in advance. A large number of undefined interpretations can be used for the word “sensible”. By way of example, one normal interpretation for a classifier would be: “form the classes such that the sum of the distances between the feature vectors and the class center points of the classes with which they are associated is as small as possible.” A criterion is thus introduced which is intended to be either minimized or maximized. The object of the classification algorithm is to carry out the classification process for this criterion and the given input data in the shortest possible time.
Topological feature maps such as Kohonen maps allow a multi-dimensional feature space to be mapped into one with fewer dimensions, while retaining the most important characteristics. They differ from other classes of neural network in that no explicit or implicit output pattern is predetermined for an input pattern in the learning phase. During the learning phase of topological feature maps, they themselves adapt the characteristics of the feature space being used. The link between a classical classifier and a self-organizing neural network or a topological feature map (TFM) is that the output pattern of a topological feature map generally comprises a single energized neuron. The input pattern is associated with the same class as the energized output neuron. In the case of topological feature maps in which a plurality of neurons in the output layer can be energized, that having the highest energization level is generally simply assessed as the class associated with the input pattern. The continuous model of a classifier in which a feature is associated with specific grades of a class is thus changed to a discrete model.
The use of Kohonen maps, inter alia, is known from the prior art. For example, the document XP002302269 by Farkas J. “Using Kohonen Maps to Determine Document Similarity” discloses use in this way. An area-specific vocabulary (“keywords”) is first of all set up in this case for a given problem, and a problem-specific thesaurus is then constructed from this (in accordance with ISO 2788). However, this prior art has the disadvantage that the only terms which can be extracted from the documents to be classified are those which likewise occur in the constructed thesaurus. For this reason in particular, this system does not allow the problem solution to be automated. The vectors which finally flow into a Kohonen network with a predetermined magnitude are then formed from the extracts mentioned. In this case, the classical Euclidian metric is used as a similarity measure. In another system from the prior art (Iritano S. and M. Ruffolo: “Managing the knowledge contained in electronic documents: a clustering method for text mining”, XP010558781), words are taken from the documents to be analyzed, are reduced to stem forms (lexicon analysis) and the frequencies of the various stem words in each document are determined. Predetermined words that are not of interest can be excluded in this case. The stem words (referred to as synonyms in the publication) are indexed for the search and a specific clustering algorithm is finally applied which uses the overlap of words in the various documents as a similarity measure. If a restriction to English documents is applied, it is also possible to determine the sense on the basis of the Princeton University WordNet. One of the disadvantages of this prior art is that the method produces only abstract clusters which do not allow any sense to be determined without situational human work, that is to say even this system from the prior art does not allow effective automation of the method. Furthermore, the restriction to the Princeton University WordNet as a knowledge base results in a constraint which, for example, does not allow universal taxonomy or use with more than one language.
Another prior art, WO 03/052627 A1 by Semper Peter Paul et al. “Information Resource Taxonomy”, describes a method which determines the frequency with which words occur in documents and forms clusters in accordance with the “TACT specification” (PCT/AU01/00198). Phrases which occur frequently are determined for noise reduction purposes in an initial phase and are then deleted if their frequency exceeds a specific limit. However, the patent specification relates essentially to a method for automatic generation of cluster hierarchies, that is to say of hierarchically structured clusters in documents. The term “resource taxonomy” used in this patent specification relates to the arrangement of the document clusters (comparable to a hierarchical directory structure for sensible storage of the documents). In WO 03/052627 A1, “taxonomy” refers to a cluster structure of directories (directory structure). In contrast, in the present patent specification according to the invention, “taxonomy” refers to the content classification of words and terms. Finally, patent Specification U.S. Pat. No. 6,711,585 B1 “System and Method for Implementing a Knowledge Management System” from the inventors Copperman Max et al. discloses a similar method with the construction of a cluster hierarchy and the association of documents and checking to form a specific cluster as WO03/052627 A1. The individual documents are in this case formally structured as “knowledge containers” (comprising meta data, taxonomy tags, marked content, original content and links). This prior art has the disadvantage, inter alia, that the cluster formation process relates to individual documents so that effective global recording of the terms that occur in the documents is impossible. Subject breakdown by further processing is thus precluded or greatly restricted. In particular, this prevents appropriate automation of the method.