The present invention relates to a string pattern analysis method and to a program product for string pattern analysis.
Providing the ability to perform searches in string patterns such as texts or biological sequence data is a commercially prosperous area. However, methods which are used, for instance, in an internet environment successfully cannot simply be transferred to enterprise environments. Additionally, content oriented issues become more and more interesting. These methods are semantic-based and less dependent on internet-specific properties. Compared to link analyses and the like, these methods are far more complex and typically language dependent.
In most of today's computer systems, text representation is not reflecting the real chunks of which the text is composed. In particular, an always co-occurring sequence of words is usually not represented as a chunk but as a distinct set of words. Knowing the real chunks in a text, i.e. long and very long substrings, however, is desirable for several reasons. It allows for a more compact representation of the text and for a better understanding of the text content, since the beginning and ending of frequent chunks are important spots in the text. In particular, elements occurring adjacently to chunks are frequently related to each other, which, for example, would allow for an automatic detection of taxonomies. For reasons of complexity, however, prior art algorithms have problems in finding the maximum substrings even in short texts since the potential number of substrings explodes with the size of the texts.
A main task in content oriented analyses usually is an adequate conceptualization, i.e. acquiring the concepts which are handled in a text as precisely as possible. Conceptualization may require finding a concept of a text in several steps (such as linguistic analysis, noun group determination, statistical relevance determination, etc.). When processing text for search or other tasks such as conceptualization (categorization) or clustering, the first step usually is to identify a basic set of terms on which higher-level components should operate. This process tries to identify meaningful parts of the overall text, often using immediate context that can be considered as “concepts” or at least “concept candidates.”
In most cases concepts are represented as noun groups in a language. In order to find noun groups in text, a syntactic analysis, which is language dependent and computationally expensive, may be required.
In most cases and across languages, noun groups are formed by consecutive elements of the text. In English, for example, noun groups are usually presented as a sequence of adjectives followed by a sequence of nouns. In German noun groups are usually presented as a sequence of adjectives followed by a single (but potentially compound) noun. Not all noun groups should be truly considered as “concepts,” but only as “candidates.” Usually, some part of the noun group constitutes the concept (i.e. a class of objects) and the rest of the associated words function to identify a particular object or instance of the concept. Therefore, identifying noun groups is not enough to get to a concept level. Some type of contextual analysis is needed. Besides requiring enormous computing power, such analyses are often language dependent.
However, even in applications such as genome analysis, although only few letters are used as an “alphabet” to represent the essential components, time and space consuming scaling problems appear.
In the paper of Stefan Kurz & Chris Schleiermacher, REPuter: Fast Computation of Maximal Repeats in Complete Genomes, 15 BIOINFORMATICS 426, (1999), a software tool is implemented that computes exact repeats and palindromes in entire genomes. Deoxyribonucleic acid (DNA) is a long polymer made from repeating units called nucleotides, wherein the DNA double helix is held together by hydrogen bonds between four bases attached to the two strands. The four bases found in DNA are adenine (A), cytosine (C), guanine (G), and thymine (T). These four bases are attached to the sugar/phosphate in the strands to form the complete nucleotide. Although genomes in DNA can be represented by an alphabet of only four characters, i.e. capital letters A, C, G, T, this reveals inherent scaling problems in the analysis. For instance, 160 MB of storage space is needed for 11 MB doing the genome analysis. For the handling of 63 characters, however, with 26 capital letters, 26 lower case letters, 10 numbers (0-9), 1 whitespace or even 256 characters for ASCII, the suffix tree in the memory grows dramatically.