1. Field of the Invention
The present invention relates to a searching apparatus and a searching method for searching a large number of data for a pattern based on a sequence of each data thereof.
2. Description of the Related Art
As conventional technologies for handing data based on a sequence thereof, a relational database and pattern matching are known. The pattern matching uses a regular expression of an occurrence sequence of character strings. For time sequence data such as stock price data, a dedicated application has been used. Next, the features and art areas of the conventional technologies will be described.
(1) Relational Database
Relational databases have been widely used for storing a large number of data. As described in a paper by E. F. Codd (1970) who created a relational database, a set of data of the relational database do not have a concept of a sequence. However, the relational database provides data types such as date type and time type. In addition, the relational database provides a sorting function for sorting records corresponding to values. Thus, the relational database can be used for storing data that have a sequence.
Although the database can handle a date as a data type, the database does not have a concept of a sequence. Thus, a query for searching for a pattern of data of which a sequence such as a time sequence is considered cannot be processed by only SQL (Structured Query Language). To solve such a problem, the process should be performed in a combination of the database and an external program. Thus, whenever a pattern of which a sequence is considered is extracted, the program should be used.
(2) Pattern Matching in Regular Expression
In the art area of a character string searching, considering the sequence of which character strings occur, a pattern matching in a regular expression has been used. First of all, it is necessary to clarify the difference between the character string searching and the pattern matching. The character string searching is an operation of which a pattern as a search target has been fully determined (for example, “search the text for a pattern abc”). On the other hand, the pattern matching is an operation of which data is searched for an undetermined pattern. The pattern matching is also referred to as pattern collating. In the pattern matching, a regular expression is used to designate a pattern.
FIG. 1A shows an example of character string data and a pattern matching of a regular expression a (a|b)*a. In the example, (a|b)* represents that a or b repeatedly occurs at least 0 time. It is likely that the character string searching and the pattern matching are identical. However, since they are classified as different categories, different algorisms should be applied thereto.
To accomplish a pattern matching in a regular expression, a finite automaton is used. To convert a regular expression into an automaton, a two-staged approach is performed. As a first stage, a regular expression is converted into a non-deterministic finite automaton (NFA). It is easy to convert a regular expression into an NFA. A pattern matching can be performed only with an NFA. Alternatively, an obtained NFA is converted into a deterministic finite automaton (DFA) that is equivalent thereto. Thereafter, with the DFA, a pattern matching is performed. The latter method is often used.
As the “deterministic” of the DFA implies, when an input is determined in some state, one destination is determined. On the other hand, as the “non-deterministic” of the NFA implies, there is a possibility of which there are a plurality of transition destinations against one input in some state.
FIG. 1B shows an NFA corresponding to a regular expression a (a|b)*a. Now, consider the case that a character string aaa is given to the NFA. When the first character a is input, the initial state 0 is transited to state 1. The second character is also a. The state 1 has two states 1 and 2 as transition destinations for the character a. Stating the conclusion first, for the second character a, the state 1 is transited to the state 1. For the third character a, the state 1 is transited to state 2. However, at the time of which the second character a is read, it is uncertain to which state the current state is transited.
To solve this problem, using a backtrack, one of the two states is temporarily selected. The state is transited to the selected state. In the selected state, the process is performed. When the process fails, the process is transited to the other state. However, when a backtrack is used, a time for the returning process is required.
Thus, rather than using the NFA converted from the regular expression, the NFA is further converted into a DFA. With the obtained DFA, a pattern matching process is performed. In the case that a DFA is used instead of an NFA, when a state and an input are determined, only one transition destination is determined. Thus, since a backtrack is not necessary unlike with an NFA, the process can be executed at high speed.
For example, an NFA shown in FIG. 1B is converted into a DFA as shown in FIG. 1C. Of course, it takes a time to convert an NFA into a DFA. However, when a pattern matching is performed for a large number of data, since the DFA that does not have a backtrack operates at high speed, the process can be performed at sufficiently high speed.
The regular expression is recursively defined by three basic operations (operators) of concatenation, union, and closure as shown in FIG. 1D. As with normal mathematical expressions, those basic operations have a priority. A closure “*” is most strongly connected. A concatenation is second most strongly connected. A union “|” is least strongly connected. However, when a character or a symbol is parenthesized, the priority thereof can be changed.
POSIX (Portable Operating System Interface for UNIX (Registered Trademark)) 1003.2 prescribes two types of regular expressions that are Basic Regular Expression (BRE) and Extended Regular Expression (ERE). Software utilities that run on UNIX (Registered Trademark) that uses the BRE are ed, ex, vi, more, sed, grep, and so forth. On the other hand, software utilities that run on UNIX (Registered Trademark) that uses the ERE are awk and grep with −E option designated.
(3) Time Sequence Application
There is a case of which sequential data is processed using a dedicated application as with a sequential pattern of a stock market forecast or a data mining. When a dedicated application is used, a time sequential pattern can be processed at high speed. However, such a dedicated application cannot be always used for various conventional queries.
However, the above-described conventional searching process has the following problems.
An regular expression of a character string and a pattern matching therewith are a general framework that provides a searching method for a character string in any class. However, as described in (1) to (3) that follow, since data of which a sequence is considered is different from character string data in their characteristics, a regular expression and a pattern matching thereof cannot be applied to the data of which the sequence is considered.
(1) A character string consists of only characters that are adjacently arranged at equal intervals. On the other hand, data of which a sequence is considered may have a plurality of events at some position. For example, there is a case that a customer makes a plurality of shopping on the same day. Thus, an expression “a customer 10001 bought two commodities of milk and bread on March 21” is required. However, with respect to a character string, two characters do not occur at the same position. Thus, a regular expression cannot represent events that occur at the same time.
(2) With respect to a character string, a value is identical to a literal. In other words, when a character string “A” is given, it is both a value of “A” and an A as a literal. However, with respect to data composed of a plurality of attributes, it is necessary to treat a combination of conditions of a plurality of fields (in this case, commodities and prices) (for example, customers who bought bread that is ¥200 or higher are called ‘customer group A’) as an literal. A combination of conditions of a plurality of fields cannot be represented in a regular expression.
(3) In the case of data of which a sequence is considered, the sequence requires a concept of an interval (for example, “cheese was bought within two days after bread was bought”. However, in a regular expression of a character string, an interval cannot be designated.
FIG. 1E shows retail sales data as an example of data of which a sequence is considered. In the example, dates on which customers bought commodities are a sequence. Referring to FIG. 1E, a customer 100001 bought milk and bread at the same time on March 21 (03/21). However, events that occur at the same time cannot be represented in a regular expression. In addition, with respect to retail sales data, there is no data on March 22 that is a holiday. However, with respect to a character string, there is no situation of which no character occurs at a particular position. In addition, it is difficult to represent a regular expression considering the relation of a sequence (for example, the interval between the date on which a particular customer bought a particular commodity at a particular store and the date on which he or she came to the store is within three days).
As was described above, it is impossible for the conventional regular expression and automaton theory to generally to designate a pattern of which a sequence is considered from a set of records of which a sequence is considered.
In addition, as was described above, a relational database supports the relation of a sequence in a limited form. Thus, when data of which a sequence is considered is searched for a designated pattern, a process should be performed in a combination of a database and an external program. However, when a program is created for each query, a pattern matching cannot be executed only by changing a pattern to be designated unlike with a pattern matching in a regular expression.
In addition, as with a stock price forecast, to analyze data of which a sequence is considered, there is a method for using a dedicated application. When data prepared for some purpose is input to a dedicated application, it returns a predetermined result. Thus, only for the purpose, the process can be performed at high speed. However, since a dedicated application is designed for a dedicated purpose, the dedicated application cannot handle a variety of problems such as a pattern matching in a regular expression.