The invention relates to a string pattern conceptualization method and to a program product for string pattern conceptualization.
Searches performed in string patterns such as text or biological sequence data is a commercially prosperous area. However, methods which are used for instance in an Internet environment successfully cannot readily be transferred to enterprise environments. Additionally, content oriented issues become more and more interesting. These methods are semantic-based and less dependent on Internet-specific properties. Compared to link analyses and the like, these methods are far more complex and typically language dependent.
In most of today's computer systems, text representation is not reflecting the real chunks of which the text is composed. In particular, an always co-occurring sequence of words is usually not represented as a chunk, but as a distinct set of words. Knowing the real chunks in a text (i.e., long and very long substrings), however, is desirable for several reasons. It allows for a more compact representation of the text and for a better understanding of the text content, since beginnings and endings of frequently encountered chunks are important spots in the text. In particular, elements occurring adjacently to chunks are frequently related to each other, which for example would allow for an automatic detection of taxonomies.
For reasons of complexity, however, prior art algorithms have problems in finding the maximum substrings even in short texts since the potential number of substrings explodes with the size of the texts.
A main task in content oriented analyses usually is an adequate conceptualization (i.e., acquiring the concepts which are handled in a text as precisely as possible). It is known the art of conceptualization to find a concept of a text in several steps, such as linguistic analysis, noun group determination statistical relevance determination, etc. When processing text for search or other tasks such as conceptualization, categorization, or clustering, the first step usually is to identify a basic set of terms that higher-level components should operate on. This process tries to identify meaningful parts of the overall text, often using immediate context that may be considered as “concepts” or at least “concept candidates.”
In most cases, concepts are represented as noun groups in a language in order to find noun groups in a language. In order to find noun groups in text, a syntactic analysis, which is language dependent and computationally expensive, is needed.
In most cases and across languages, noun groups are formed by consecutive elements of the text. In English, usually a sequence of adjectives followed by a sequence of nouns, in German by a sequence of adjectives followed by a single (but potentially compound) noun. Not all noun groups should be truly considered as “concepts” but only as “candidates.” Usually, some part of the noun group constitutes the concept (i.e., a class of objects) and the rest has the function to identify a particular object or instance of the concept. Therefore, identifying noun groups is not enough to get to a concept level. Some type of contextual analysis is needed. Besides requiring an enormous computing power, such analyses often are language dependent.
However, even in applications such as genome analysis, although only few letters are used as an “alphabet” to represent the essential components, time and space consuming scaling problems appear.
In the paper of S. Kurz and C. Schleiermacher, “REPuter: fast computation of maximal repeats in complete genomes”, Bioinformatics Applications Notes, Oxford University Press, vol. 15, n0. 5, 1999, p. 426-427, a software tool is implemented that computes exact repeats and palindromes in entire genomes. DNA (DNA=desoxyribonucleic acid) is a long polymer made from repeating units called nucleotides, wherein the DNA double helix is held together by hydrogen bonds between four bases attached to the two strands. The four bases found in DNA are adenine (abbreviated A), cytosine (abbreviated C), guanine (abbreviated G) and thymine (abbreviated T). These four bases are attached to the sugar/phosphate in the strands to form the complete nucleotide. Although genomes in DNA can be represented by an alphabet of only four characters (i.e., capital letters A, C, G, T) this reveals inherent scaling problems in the analysis. For instance, 160 MByte storage space are needed for 11 MByte doing the genome analysis. For the handling of 63 characters, however, with 26 capital letters, 26 lower case letters, 10 numbers (0-9), 1 whitespace or even 256 characters for ASCII, the suffix tree in the memory grows dramatically.