1. Field of the Invention
The present invention relates to a method and an apparatus for searching for a pattern including a repeating pattern from among a large amount of data on the basis of an order possessed by each datum.
2. Description of the Related Art
A conventional technology for handling a repeating patterns of the ordered data includes a pattern matching process using a regular expression for the appearance order of a character string.
The following is the features and the application field of this pattern matching.
In the field of character string search, pattern matching using a regular expression has been used and attention is paid to the appearance order of a character string. First of all, the difference between character string search and pattern matching is clarified. In character string search, a pattern to be searched for is completely defined as “search for a pattern ‘abc’ from a sentence”. However, pattern matching is an operation of searching for an indefinite pattern, and it is also called a pattern collation. In pattern matching, a regular expression is used to designate a pattern.
FIG. 1A shows an example of the pattern matching of character string data and a regular expression ‘a(a|b)*a’. Here, ‘(a|b)*’ means that ‘a’ or ‘b’ repeatedly appears 0 or more times. Character String Search and Pattern Matching seem to the same process at first glance, but they belong to different categories. Therefore, a different algorithm must be applied to each of them.
For the realization of pattern matching using a regular expression, a finite automaton is used. A two-step approach is taken for the conversion of a regular expression into an automaton. Firstly, a regular expression is converted into an NFA (Non-deterministic Finite Automaton). This conversion into an NFA from a regular expression is easy. Pattern matching can be performed only by an NFA. In many cases, however, the obtained NFA is converted into an equivalent DFA (Deterministic Finite Automaton), and then pattern matching is performed using this DFA.
In a DFA, if input is decided in a specific state, only one transition destination is determined, as the term “deterministic” indicates. In an NFA, on the contrary, if input is decided in a specific state, a plurality of transition destinations might exist as the term “non-deterministic” indicates.
FIG. 1B shows an NFA corresponding to a regular expression ‘a(a|b)*a’.Assume the case where a character string ‘aaa’ is given to this NFA, and then if the first character ‘a’ is input, NFA makes a transition from a starting state 0 to a state 1. The second character is also ‘a’, but there are two types of states: state 1 and state 2 as the transition destination of this character ‘a’. In conclusion, it is correct for the second character ‘a’ to transit from state 1 to state 1 and for the third character ‘a’ to transit from state 1 to state 2. When the second character ‘a’ is read in, however, it is not determined which state to transit to.
In order to solve this problem, a process in which NFA makes a transition to either state, and if the process fails, it makes a transition to the other state using a back track, is needed. When a back track is used, however, an extra time for turning back to the first state is required.
Thereupon, a pattern matching process is performed not by directly using an NFA that is obtained by converting a regular expression, but by furthermore converting the NFA into a DFA. In the case of a DFA, only one transition destination is always determined differently from an NFA if a state and input are decided. Therefore, the use of a DFA does not require a back track compared to an NFA, thereby enabling a process to be performed at high speed.
For example, the NFA of FIG. 1B is converted into a DFA as shown in FIG. 1C. Here, only one transition destination is determined for the character ‘a’, and there is no such ambiguity that is seen in the NFA. Therefore, a back track is no more required. Needless to say, it takes time to convert an NFA into a DFA beforehand. In the case where pattern matching is performed for a large amount of data, the speed of this process is sufficiently improved as a whole with the high-speed performance of the DFA that does not require a back track.
A regular expression is recursively defined by three basic operations (operators), such as connection (concatenation), selection (union), and repetition (closure) as shown in FIG. 1D. There are operational priorities among these operations much like a general numeric equation. The strongest coupling is the repetition ‘*’, the second-strongest is the connection, and the last is the selection ‘|’. However, the priority can be also changed by enclosing characters or symbols within parentheses.
However, there are the following problems in the above-mentioned conventional search process.
Regular expression used in a character string and pattern matching using this expression are general frameworks for providing a search method for the character string in any class. Regular expression and pattern matching using this expression cannot be applied to ordered data as outlined by the following conditions (1) to (3), since the characteristic of the data differs from that of character string data.    (1) In a character string, all the characters are adjacent to each other at regular intervals. In the case of the ordered data, however, there is a case that a plurality of events may exist in a specific position. For example, this is the case for a client who shops for several items on the same day. In this case, an expression “a client 10001 purchases two commodities such as milk and bread on March 21” is required. A regular expression, however, cannot express such events that simultaneously occur since two characters cannot appear in the same position in a character string.    (2) In a character string, a value and a symbol (literal) are equal. Namely, when ‘A’ is given as a character string, ‘A’ indicates both a value and a symbol. In the data consisting of a plurality of attributes, however, the combination of the conditions of a plurality of fields must be handled as one symbol. For example, a client who purchases commodities such as a PC and a TV is called a “client group A”. A regular expression, however, cannot express the combination of the conditions of such a plurality of fields.    (3) In the case of ordered data, the concept of an interval becomes necessary for orders of data as “the total number of days between the purchase of a PC and that of a TV is within three months”. In the regular expression of a character string, however, an interval cannot be designated.
A technology for handling the ordered data is described in an earlier-filed Japan Patent Application No. 2001-340817 (U.S. patent application Ser. No. 10/092,444) “Searching Apparatus and Searching Method Using Pattern of which Sequence is Considered”. In this application, the data to be processed consists of a set of records with a plurality of fields (attributes). It is assumed that each record has a predetermined number of fields but it is not assumed that only a specific record has a different number of fields. Furthermore, it is assumed that one or more fields with an order are included in the data.
A field with an order means a field having, in advance, an order relation, such as a date and time, or a field in which an order is generated by rearranging data like a client ID (client identifier) field. Since the combination of a date field and a time field can be regarded as one order field, in some cases, an order of a record may be represented by combining a plurality of fields.
When a pattern with an order is searched for from the target data, the pattern is designated by an event definition and an inter-event definition.
An event definition is created by uniquely naming a condition designated for one or more fields. In the case where a condition is designated for one field, it can be defined, for example, that “a client who purchased a PC as a commodity is called ‘a client group A’”. Furthermore, in the case of designating a condition for a plurality of fields, the combination of conditions of a plurality of fields is handled as one symbolic label (literal), for example, as “a client who purchased a PC and a camera as commodities is called ‘a client group A’”.
Specifically, an event can be defined as the label of a record that satisfies the conditions for one or more fields. Furthermore, an event can designate a condition that matches any pattern like a wildcard in a regular expression.
FIG. 1E shows an example of an event definition defined as “a client who purchased a PC and a camera as commodity is called ‘a client group A’”. As already described, the combination of conditions of a plurality of fields cannot be expressed by a regular expression of a character string.
In addition to an event definition, an inter-event definition describes a relation between one event and another event utilizing an event definition. In the case of the inter-event definition, the condition where there are a plurality of events with the same order or the condition where the interval between orders is not constant (the condition where orders are described at an arbitrary interval) is also conceivable.
If it is assumed that a client who purchased a PC and a camera as commodities is called a client group ‘A’ and a client who purchased a TV and a VTR as commodities is called a client group ‘B’, an inter-event definition such as “the interval between ‘A’ and ‘B’ is within three visits to the shop” is conceivable. Also, a definition, such as “an interval between ‘A’ and ‘B’ is within three days (the difference between the date field of ‘A’ and that of ‘B’ is within three days) is also conceivable.
Furthermore, the restriction covering an event and another event can be also described for fields with no order. For example, a definition, such as “the price of ‘A’ is higher than that of ‘B’” is conceivable. Furthermore, it is possible to designate a condition by an inter-event definition even in the case where an event definition is designated by a wildcard that matches any pattern.
FIG. 1F shows an example of an inter-event definition. In this example, the interval between event ‘A’ and event ‘B’ is within three days, and an interval between event ‘A’ and an event ‘C’ is within five days.
In a regular expression, an expression of a . . b using ‘.’ that matches all the characters means that ‘b’ appears after a sequence of three characters beginning with a literal ‘a’, which differs from the event definition that a condition is designated for one or more fields.
Furthermore, the fact that the relation between arbitrary events can be defined means that a matching pattern to be searched for can be expressed by a graph structure. In the example of FIG. 1F, an inter-event restriction exists between an event definition 1 and an event definition 2, while the inter-event restriction exists between an event definition 1 and an event definition 3.
By designating a pattern using an event definition and an inter-event definition, a pattern designation search for an ordered record group can be realized. If a pattern includes a repetition designation, however, in some cases, a back track is required in the search process. The following is the explanation of this problem using the multidimensional data (data of a plurality of fields) shown in FIG. 1G, as an example.
Of the data shown in FIG. 1G, RID is a record identifier, and each record possesses three fields, namely, purchase date, commodity, and price. An order is defined by a purchase date. A continuous purchase date does not mean a consecutive calendar date, but rather it is assumed to be a date when a client comes to the store next. A search pattern query (event pattern) is given as follows:
Event Definition                Event1: commodity=B        Event2: commodity=C        
Order                (Event1+)−Event2        
Inter-Event Definition                Event2.purchase date<=Event1.purchase date+2 days        
In this event pattern, the order indicates that Event2 occurs subsequently after Event1 continuously occurs one or more times. The inter-event definition indicates that the interval between the purchase date of Event1 and that of Event2 is within two days. In a search process of the above-mentioned prior application, an inquiry pattern as shown in FIG. 1H, is generated from this event pattern, and the process proceeds using two pointers, such as a pointer DP to the data and a pointer PP to the inquiry pattern.
Regarding PP, FIG. 1H shows that Event1 (commodity=B) repeats one or more times if PP=1, while Event2 (commodity=C) is pointed to by the pointer if PP=2. In this case, since the order of appearance of Event2 is after that of Event1, the inter-event definition is added in the location of PP=2.
Initial state of DP=2001/01/13 and PP=1 is set by an initialization process, and firstly, it is checked to see whether or not any datum from the data of DP=2001/01/13 matches the pattern of commodity=B that is described by the event definition at PP=1. In this case, since the record of R1 matches the pattern, the pointer DP is incremented and DP becomes 2001/01/15. Regarding PP, two cases exist: a case where Event1 is continuously repeated and a case where a record matches commodity=C of Event2. In the case where there is a branch into two or more cases like this, a back track for checking for an alternative branch destination becomes required if a selected branch is executed and found not to be correct.
For example, in order to check whether or not PP=1 is continuously repeated, it is checked whether or not the pattern of the event definition of PP=1 matches the data of DP=2001/01/15. In this case, since the record R5 matches the condition of PP=1, DP becomes 2001/01/16. Regarding PP, there is a branch into two cases where the repetition of Event1 should be checked and Event2 should be checked.
Thereupon, the record R8 matches the data of DP=2001/01/16 when the repetition of Event1 is checked. In other words, three matchings of commodity B as defined in Event1, such as Event1-Event1-Event1 can be obtained. If DP=2001/01/20, however, since there is no matching data in Event1 nor Event2, the matching fails. Therefore, the process should be performed again from the branch.
In the majority of cases, many commodities are purchased at the same time generally which is reflected in ordered multidimensional data such as the receipt of POS (Point-Of-Sales). When a back track occurs in the search process for such multidimensional data, the efficiency of the process deteriorates remarkably. Therefore, a method of searching ordered data without the need for a back track, is desirable.