Exemplary embodiments relate to regular expression learning and particularly to techniques for improving regular expressions.
Regular expressions have served as the workhorse of information extraction (IE) systems for several years. FIG. 1 illustrates an example of a conventional way to develop regular expression (regrex) for information extraction. A user inputs a regular expression (regex) at 100. The regular expression is run on a collection of documents at 110. The user labels match 1 through match 1r at 120.
The user determines if the regular expression is good enough at 130. If the regular expression is satisfactory to the user, the regular expression is final and the process ends at 140. If the regular expression is not satisfactory to the user, the user creates a new regular expression at 135 and the new regular expression is run.
This popularity of regular expression stems from the fact that regular expressions are sufficiently expressive, formally well-understood, and supported by a wide range of languages for describing textual patterns. However, despite this popularity, there has been very little work on reducing the manual effort involved in designing high-quality regular expressions for complex information extraction tasks.