Embodiments of the present invention relate to regular expressions and, more specifically, to longest regular expression matches.
The process of extracting information from large-scale unstructured text is called text analytics and has various applications in business analytics, healthcare, security intelligence domains, and other fields. Analyzing unstructured text and extracting insights hidden in that text at a high bandwidth and at a low latency are computationally challenging tasks. In particular, text analytics functions rely heavily on regular expressions (regexes).
Consider, for example, a text analytics platform that continuously collects news entries from various data sources. When a user submits a news search query that contains a set of keywords, including a company name and the word “Germany,” the news search engine retrieves from its index all news entries that contain these keywords. The relevant news entries are then parsed to identify phrases, which might, for instance, reveal a business expansion strategy into Germany. This second stage acts as a second level of filtering, and only entries that contain interesting and useful information are transferred to the user, preferably quickly.
Further, this second stage requires a deeper analysis of the news entries and is, thus, computationally more intensive than a simple keyword search. When thousands of users submit news search queries concurrently, this second stage becomes a computational bottleneck. One way of eliminating this bottleneck is to scale up the number of processor cores in the text analytics platform, which results in increased space and energy consumption as well as degraded reliability.
Regex searching can help with tasks like the above, as well as other tasks involving natural language processing, by pattern matching when the analytics system does not know how the company name and the term “Germany” will appear in relation to each other. A regex is an expression used to define a search pattern, as opposed to a specific search string. Compared to string searches, regexes provide more flexibility within a search.
Text analytics systems compute a span, often stored in a span data structure, for each regex match. Each span includes a start offset and an end offset, where the start offset indicates a position in the searched text at which the match begins, and the end offset indicates a position at which the match ends.
To eliminate ambiguity, text analytics systems often employ well-defined tie-break heuristics when dealing with overlapping matches. For example, when multiple regex matches have the same end offset, typically only the span with the smallest start offset is reported. This technique is referred to as a leftmost matching heuristic. A more common heuristic is called leftmost longest matching, which requires computation of the leftmost regex matches that are not contained by other regex matches.