The processing of sequences of characters to analyze their grammatical structure is well-known, both for analyzing natural languages and computer languages. In the case of natural languages, the sequences of characters are broken down into words, each word forming a part of speech, such as noun, verb, adjective, adverb, preposition and no on. Thus, each word can be allocated a class according to its function in context.
For the processing of computer languages, it is well known to process the sequence of characters in a lexer to break the characters into a sequence of tokens and then to parse the tokens to create some form of internal representation, which can then be used in a compiler or an interpreter.
Such processing has previously been used to analyze sequences of characters to extract useful information from the sequence. For example, techniques have been developed to analyze blocks of text, such as e-mails or other data received by or input to a computer, to extract information such as e-mail addresses, telephone and fax numbers, physical addresses, IP addresses, days, dates, times, names, places and so forth. In one implementation, a so-called data detector routinely analyses incoming e-mails to detect such information. The detected information can then be extracted to update the user's address book or other records.
Conventionally, such data detection is performed using a layered engine as shown in FIG. 1. The engine is embodied in a processor 1 and comprises a lexical analyzer or lexer 10 and a parser 20. The lexer 10 receives as its input a sequence of characters, such as the characters in an e-mail message. Note that the characters are not limited to letters or even numbers, but may include any other characters, such as punctuation.
The lexer 10 stores a vocabulary that allows it to resolve the sequence of characters into a sequence of tokens. Each token comprises a lexeme (analogous to a word) and a token type (which describes its class or function). One token type is provided for each predetermined function. As an example, a simple lexer 10 may include the following vocabulary:    DIGIT:=[0-9] (A digit is a single number from 0 to 9)    NUMBER:=DIGIT+(A number is two or more digits together)    LETTER:=[a-zA-Z] (A letter is an upper or lower case letter from A-Z)    WORD:=LETTER+(A word is two or more letters together)
The lexer 10 would break down the string of characters “There are 2 books and 15 magazines” into the following tokens:
LexemeToken TypeTHEREWORDAREWORD 2DIGITBOOKSWORDANDWORD15NUMBERMAGAZINESWORD
The parser 20 receives the sequence of tokens from the lexer 10. The parser 20 includes a grammar, which it uses to analyze the tokens to extract predetermined data. For example, if the engine 1 is intended to detect all quantities, the parser 20's grammar may be that:    QUANTITY:=DIGIT WORD|NUMBER WORDwhere “|” indicates “or”. Thus, on receiving the sequence of tokens from the lexer 10, the parser 20 will return the quantities “2 books” and “15 magazines”.
Commonly, both the lexer 10 and the parser 20 use a decision tree. An example of such a decision tree for a further example of a lexer 10 is shown in FIG. 2. In this case, the lexer 10 includes the following vocabulary:    a:=1 9 [0-9]{2}    b:=1 9 5where ‘a’ and ‘b’ are two token types that the lexer 10 can ascribe to different lexemes. The decision tree in FIG. 2 shows 5 possible states in addition to the start state. As the lexer 10 processes a sequence of characters, it checks the first character in the sequence against the options available at the start state S and proceeds according to the result.
For example, if the lexer 10 is presented with the sequence of characters ‘1984’, it will process the character ‘1’ first. State S only allows the processing to proceed if the first character is ‘1’. This condition is met so character ‘1’ is consumed and processing proceeds to state 1, where the next character in the sequence (‘9’) is compared with the available conditions. It should be noted that state 1 is represented using a dotted circle. This is indicative that processing may not end at this state without the branch dying, as will become apparent later.
The only available condition at state 1 is that the next character is ‘9’. This condition is met, so character ‘9’ is consumed and processing proceeds to state 2.
The conditions at state 2 are that processing should proceed to state 3 if the next character is ‘5’, or that it should proceed to state 4 if the next character is any one of 0, 1, 2, 3, 4, 6, 7, 8 or 9. Again, state 2 is represented using a dotted circle and processing may not end at this state.
The next character is ‘8’, which meets the condition for processing to proceed to state 4, which is also represented by a dotted circle. Accordingly, the ‘8’ is consumed and processing continues. Since the next character in the sequence (‘4’) meets the only available condition from state 4, processing proceeds to state 5.
State 5 is represented by a solid circle, indicating that processing may end there. As shown in FIG. 2, state 5 has the property of reducing the consumed characters to a token of token type ‘a’. In our example, since all the characters have been used up and there are no more characters, processing ends at state 5 and the consumed sequence of characters is reduced to a token comprising the lexeme ‘1984’ and the token type ‘a’.
Similarly, the lexer 10 in FIG. 2 would process the sequence of characters ‘195’ as set out below. First, characters ‘1’ and ‘9’ would be consumed in the same manner as described above. However, at state 2, the next character is ‘5’. This meets the condition for proceeding to state 3, which has the property of reducing the consumed characters to a token of token type ‘b’. In this case, since all the characters have been used up and there are no more characters, processing ends at state 3 and the consumed sequence of characters is reduced to a token comprising the lexeme ‘1985’ and the token type ‘b’.
By contrast, the lexer 10 in FIG. 2 would process the sequence of characters ‘1955’ as set out below. First, characters ‘1’, ‘9’ and ‘5’ would be consumed in the same manner as described above. However, at state 3, not all the characters have been used up. Rather, a further ‘5’ remains, which meets the condition for proceeding to state 5, where the consumed sequence of characters is reduced to a token comprising the lexeme ‘1955’ and the token type ‘a’.
Now consider a parser 20 including the following grammar:    A:=a|ε    E:=Acd|cewhere A and E are predetermined grammatical or data categories that we wish to detect; a, c, d and e are various token types; and ε represents a “nothing”. Thus, the parser 20 outputs a category A if either a lexeme with token type ‘a’ is presented or an unmatched token type is presented. Similarly, the parser 20 outputs an E when it processes Acd or cc. However, since the parser 20 outputs an A when presented with a token type ‘a’ or with a nothing, by substituting the equation for A into the equation for F, it can be seen that in practice the parser 20 outputs an E when it processes any of acd, cd and ce.
A decision tree for this grammar is shown in FIG. 3 and includes start state S, finish state F, and processing states 0-5. As the parser 20 processes a sequence of tokens, it checks the first token in the sequence against the options available at the start state S and proceeds according to the result.
For example, if the parser 20 is presented with the sequence of tokens comprising a token having token type c, followed by a token having token type e, the parser 20 must process the token-type sequence ‘ce’. The following table represents the processing that takes place.
Current stateSequence to processPrevious statesSce0eS1S, 0SEFS
Put simply, proceeding from the start state S, the parser 20 consumes a ‘c’ and proceeds to state 0, and then consumes an ‘e’ and proceeds to state 1. State 1 allows processing to finish with the reduction to go back two states and replace the consumed letters by an ‘E’. Processing then returns to the start state S, where the E is processed. The E is consumed as processing proceeds to the finish state F. Thus, the token type sequence c followed by e is parsed as having the grammatical or data type E.
Similarly, the token sequence ‘acd’ is processed using the parsing tree shown in FIG. 3 as shown in the following table:
Current stateSequence to processPrevious statesSacd5cdSSAcd2cdS3dS, 24S, 2, 3SEFS
Here, the first token type to be parsed is ‘a’. Starting at start state S, the ‘a’ is consumed and processing proceeds to state 5, which has the reduction to go back one state and replace the consumed items with an ‘A’. Thus, the sequence is changed from ‘acd’ to ‘Acd’ and processing returns to state S, where the A is consumed and processing proceeds to state 2. Next, as processing proceeds along the middle branch of the tree to states 3 and 4, the c and the d are consumed. At state 4, the consumed sequence Acd is replaced by an E and processing returns to state S, where the E is processed. The E is consumed as processing proceeds to the finish state F. Thus, the token type sequence a followed by c followed by d is also parsed as having the grammatical or data type E.
Similarly, the token sequence ‘cd’ is processed using the parsing tree shown in FIG. 3 as shown in the following table:
Current stateSequence to processPrevious statesScdSAcd2cdS3dS, 24S, 2, 3SEFS
Here, the first token type to be parsed is ‘c’. Starting at start state S, the ‘c’ is consumed and processing proceeds to state 0. The next token type to be parsed is a ‘d’, but state 0 does not provide an option for proceeding with this token type. Moreover, state 0 is represented by a dotted circle, indicating that processing cannot finish at that state. Accordingly, this branch is a “dead” branch and processing reverts with the entire sequence intact to the start state S. This state is provided with the reduction that an ‘A’ must be placed at the front of the sequence. Thus, the sequence to be parsed is now ‘Acd’. This is the same sequence as is generated during processing of the sequence acd above, and processing proceeds in exactly the same way. Thus, the token sequence c followed by d is also parsed as having the grammatical or data type E.
In particular, the example illustrates how the epsilon symbol is handled. Specifically, an additional path is provided, the additional path comprising a link between the start state S and state 5. This path is taken when the first token is an ‘a’, which is consequently consumed and replaced with an ‘A’.
Importantly, the when all of the conditions of the start S lead to a dead branch, the reduction associated with the start state S is performed. This reduction involves producing a new token (in this case an ‘A’) and adding it to the front of the sequence of tokens without first consuming a token. Put another way, in this reduction the sequence of tokens is revised by adding a token to the beginning and then comparing the revised sequence with the conditions of the same state. This type of reduction is known as an epsilon reduction.
In this way, it can be seen that the parsing tree shown in FIG. 3 is consistent with the grammar:    A:=a|ε    E:=Acd|ce            The foregoing is a simple explanation of the basic functionality of lexers 10 and parsers 20. This functionality can be adapted to detect predetermined types of data from a sequence of characters, for example in an e-mail or a block of text. Imagine that it is intended to detect either a time or a bug identification code in a block of text. In the following example, the format of a time to be detected is that it is always one of AM, PM, A or P followed by two digits, whereas the format of a bug identification code to be detected is always two letters followed by three digits. Accordingly, the lexer 10 may be provided with the vocabulary:            INITIALS:=[A-Z]{2} (INITIALS is any two letters together)    MERIDIAN:=(A|P)M? (MERIDIAN is the letter A or the letter P, optionally followed by the letter M)    DIGIT:=[0-9] (DIGIT is any character from 0 to 9)whereas the parser 20 may be provided with the grammar:    BUG_ID:=INITIALS DIGIT{3} (INITIALS token followed by 3 DIGIT tokens)    TIME:=MERIDIAN DIGIT {2} (MERIDIAN token followed by 2 DIGIT tokens)
In more detail, the lexer 10 will output a sequence of a letter from A to Z followed by another letter from A to Z as a token having a lexeme of the two letters and having the token type INTIALS. It will also output the letters AM and PM as a token having the token type MERIDIAN. In this notation ‘?’ indicates that the preceding character(s) may or may not be present. Thus, the lexer 10 will also output the letter A alone, or the letter P alone as a token having the token type MERIDIAN.
FIG. 4 shows a decision tree of the lexer 10 and FIG. 5 shows a decision tree of the parser 20. As will be clear from following the decision tree shown in FIG. 4, the lexer 10 will process the sequence of characters AM02 and output four tokens. The first is a token having the lexeme AM and the token type INITIALS, while the second is a token also having the lexeme AM, but this time the token typeMERIDIAN. This is consistent with the vocabulary used by the lexer 10, since the letters AM can be either INITIALS or a MERIDIAN. The third and fourth tokens have the lexemes ‘0’ and ‘2’ respectively and each has the token type DIGIT. This sequence of three tokens is then operated on by the parser 20.
As noted above, the two tokens both have the lexeme AM and the respective token types INITIALS and MERIDIAN. Accordingly, when the character string AM occurs, two sequences of tokens are processed by the parser 20 using the decision tree shown in FIG. 5. One sequence of tokens meets the first condition of the starting state, while the other sequence of tokens meets the other condition. Accordingly both conditions or branches are investigated, either in turn or in parallel.
In the case of the left-hand INITIALS branch, the processing proceeds to state 1 and then states 2 and 3, since the next two tokens have the token type DIGIT. However, the parser 20 then runs out of tokens to parse and so cannot proceed to state 4. Since state 3 is represented by a dotted circle, processing cannot end there and so a BUG_ID is not detected.
In the case of the right-hand MERIDIAN branch, the processing proceeds to state 5 and then states 6 and 7, since the next two tokens have the token type DIGIT. At state 7 it is determined that the sequence of tokens MERIDIAN followed by DIGIT and DIGIT represents TIME. In this way, a time is detected.
In some cases, in real life situations it is possible to detect two different types of information (eg TIME and BUG_IDENTIFICATION) from the same sequence of characters, for example where the results are overlapping. For instance in the BUG_ID/TIME example, consider the character sequences “AM12” in “AM123”. Within “AM123” we could recognize both a time (characters 1 to 4), and a bug identification code (characters 1 to 5). In such an event, it is common practice to provide an additional filter to determine which of the two detected types of information is more likely to be the correct one. One commonly-used heuristic that has proven efficient is to keep only the longest result—in this case, the bug identification code.
As another example, the parser may be provided with the grammar    ADDRESS:=name? company? street
Accordingly, to detect an address, it is only necessary for a street to be present, a name and/or a company in front of the street being optional. Thus, an epsilon reduction is required for both the name and company. Using the tokens a, b and, c, the grammar can be rewritten as    a:=name|ε    b:=company|ε    c:=street    ADDRESS:=abc
FIG. 6 shows a corresponding decision tree for the parser 20, which determines that an address has been detected when it reaches state F. In this case, the fact that the “name” token is optional is handled by the path from the starting state S to state 1, the reduction for state 1 and the epsilon reduction for starting state S. Similarly, the fact that the “company” token is optional is handled by the path from state 2 to state 5, the reduction for state 5 and the epsilon reduction for state 2.
Such a methodology can be applied to many different types of grammar and data structures and has previously been found to be particularly successful in extracting predetermined types of data from sequences of characters. However, in view of the increasing calls on the processors of user and server computers to carry out numerous tasks (including data detection), combined with the increasing volume of information that needs to be scanned and the increasingly complex and numerous types of information it is desired to detect, it is desirable to increase the speed with which such data detection can be carried out.