A text is a sequence of symbols or items. Each symbol or item is called a character. The set of all possible characters for a domain is called the alphabet. One example domain is texts that are news stories represented in computer memory, with the ASCII character set as the alphabet. Another example domain is texts that are genetic sequences represented on a computer hard disk, with the alphabet consisting of characters A, C, G, and T.
A pattern specifies a set of character sequences. The set may be as simple as a single sequence, e.g., {GACACT}. Alternatively, the set of sequences may be more complex, e.g., “any sequence that begins with GAC, then has an intermediate sequence with length a multiple of three, then ends with ACT.”
A counter contains one or more pattern-amount pairs. For example, a counter corresponding to baseball news may contain patterns such as the word “baseball” and the phrase “relief pitcher.” The amount for each pattern may correspond to how strongly the pattern indicates that a text is on the topic of baseball. The score for a counter on a text, starting at a position in the text, is the sum over counter patterns of the amounts associated with the patterns that occur in the text starting at the specified start position. The score for a counter on a text is the sum of counter scores at start positions. The start positions are determined by the domain. For example, for news the start of each word may be a start position. So if the counter pattern-amount pairs are (“ball”, 0.5) and (“relief pitcher”, 2.0) and the text is “The relief pitcher threw the ball,” then the score is 2.5.
Computing the counter score for a text is called counter evaluation. A counter may be evaluated using a finite-state machine. A finite-state machine contains states. There is a specified start state. The machine transitions from state to state based on the sequence of characters in the text. The rules for transitions from a state may be represented as a list of pairs, each of the form (character, next state). Each pair is called a transition. When the machine is in the state and encounters a character in a pair, the machine transitions to the next state indicated by the pair. If the encountered character is not in any transition from the state, then the machine halts. Each state has a value. For a finite-state machine representing a counter, the value for a state reached immediately upon encountering one or more counter patterns is the sum of amounts corresponding to the patterns, and the values for other states are zero. To perform a counter evaluation, apply the machine corresponding to the counter starting at each start position in the text. Transition from state to state based on the character sequence, accumulating in a sum the value of each state that the machine visits.
A multi-counter contains a set of counters. The score for a multi-counter on a text is a list of scores for the counters in the multi-counter. Computing the multi-counter score for a text is called multi-counter evaluation. Multi-counter evaluation is useful to evaluate a news text for relevance to a set of topics, with each counter corresponding to a topic. Depending on topic relevance, a news story may be forwarded to people interested in the story. Also, the combination of topic scores may be used to evaluate when there is a shift in the focus of news. Multi-counter evaluation is useful to search a genetic sequence for indications that the sequence codes different protein families, with each counter corresponding to a protein family and each pattern corresponding to a code for a substructure commonly found among proteins of the family.
One method of multi-counter evaluation is to perform counter evaluations in sequence for each counter in the multi-counter. Another method is to use multiple computational resources to perform two or more counter evaluations at once. The first method is relatively slow, and the second method is relatively resource-intensive.