1. Field of the Invention
The present invention relates to a system, method and program for searching for a text matching a predetermined pattern from text data.
2. Description of the Related Art
There is a demand in technology of searching for a text matching a predetermined pattern from text data. Specifically, texts matching a given pattern are searched and problem analysis is performed on the found texts. There is also a similar demand for compliance violation analysis.
Taking interactions at a call center as an example, one of the conceivable tasks is carried out for a mistake that “Chumon to chigau seihin ga todoita (The product different from the one ordered has arrived).” In this task, a search pattern is created corresponding to the content of the mistake, and then documents of interactions are searched based on the search pattern to track a change in the number of searched-out documents before and after a measure for the mistake was taken. The task for such a purpose requires such high accuracy that the texts of the interactions first need to be parsed through language processing and then processed through pattern matching.
In this case, for example, it is considered to obtain documents matching the following pattern.                “chigau (different)” modifies “seihin (product)”        “seihin (product)” modifies “todoku (arrive)”        
The parsing result has a tree structure called a parse tree expressing a dependency structure between words for each sentence. Moreover, a pattern to match nodes in the parse tree is also expressed by a tree structure. As a result, matching is a problem of determining whether or not the parse tree includes the pattern as a partial structure of parent and child nodes having a gap within an allowable range.
In Omnifind Analytics Edition provided by International Business Machines Corporation, a pattern is previously described and pattern matching is performed for all documents in batch processing.
However, the pattern description in the above case has the following problems.
1. Pattern creation involves trial and error, and sequential processing is needed from pattern editing to result browsing. Thus, efficiency is poor. Particularly, when a data size is large, one may have to wait for one day or more until the one can start to check a result of editing.2. It is impossible to know what kind of pattern exists unless the entire text data is checked.3. There is no clue to finding an unknown pattern in searching for a pattern useful for tasks.
In terms of searching on the tree structure, as a search technology for XPath, there is a technology described in “A Fast Index for Semistructured Data” (Brian F. Cooper, Neal Sample, Michael J. Franklin, Gisli R. Hjaltason, Moshoe Shasmon, The VLDB Conference 2001). In this technique, a table having a preorder and a postorder of each node is prepared in a relational database (RDB), and each node is handled as one record therein. By applying this technology to the parsing result, the above problem 1 can be solved. However, it takes several seconds to search through data of 100 MB for simple dependency including two words. Moreover, searching through data of several to several tens of GB takes such a long time that a user feels stress. Meanwhile, no solutions can be provided for the above problems 2 and 3.
For heuristic listing of patterns, there has been known a technology described in the document “Efficiently Mining Frequent Trees in a Forest” (Mohammed J. Zaki, Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Jul. 23-26, 2002) related to tree mining. This technology enables extraction of an embedded sub-tree (sub-tree including parent and child nodes not having a direct parent-child relationship in an original tree), which frequently appears by batch processing. However, when this technology is applied to the parsing result, a large amount of patterns obvious to a user are extracted, such as “onegai” “itasu” and “denwa (telephone)” “wo” “kiru (hang up)”. The technology does not serve as a solution to the above problem 3.
“A Dependency Analysis Model with Choice Restricted to at Most Three Modification Candidates” (Hiroshi Kanayama, Kentaro Torisawa, Yutaka Mitsuishi and Jun-ichi Tsujii, Journal of Natural Language Processing, vol. 7, No. 5, pp. 71-91, 2000) proposes a triplet/quadruplet model in which: the conditional part of the probability consists of information on a modifier clause and all its modification candidates; and the probability that a candidate is chosen as the modifiee is calculated.
Japanese Patent Application Publication No. 2007317139 discloses supporting document data analysis by focusing on a relationship between dependencies. A dependency search condition input part specifies a dependency to be retrieved. In a normal search, a keyword and its search position (a modifier part or a modifiee part or both) are specified. A dependency search part extracts the dependency corresponding to the specified keyword and search position, by referring to a basic meaning chunk set storage part in a dependency set storage part. The dependency search part extracts a dependency of a modifier part or a modifiee part by referring to a meta-meaning chunk storage part in the dependency set storage part. Moreover, a display part displays a dependency set as a search result.