Modern software applications typically operate with data stored in well-structured form, such as normalized relational databases, delimited or XML files. Often-times business applications have to interact with loosely-structured data, in which identification of particular parts depends on other parts and some other conditions. Examples of such data include: full names, that can include first, middle and last name(s), title(s), suffix(es), last name prefix(es), etc.; mailing addresses, that can include cities, states, zip codes, street addresses, PO Boxes, apartment numbers, etc.; and Internet URLs, that can include protocol, domain name, IP address, page relative path, page name, parameters, etc.
In order to process and maintain such data, computer programs have to be able to identify particular items in such loosely-structured data. For example, when sampling of data is created for testing, all sensitive data including real names and addresses are typically required to be replaced with fictitious values. While identification of the items within some kinds of well-structured data may be trivial, it can become very complicated when the analysis involves many optional data items with complex ordering and separation rules.
The approach most commonly used for finding data subsets matching given patterns is based on the Regular Expressions. Regular Expressions (or Regex) has been a standard in computer science since 1960s as a formal language that is capable of describing pretty complex matching rules, and its multiple implementations are widely used in the industry.
While Regex is extremely powerful and efficient for identification of a single data subset, it is very limited in defining non-trivial relationships between multiple data items. The only method available is based on lookarounds (lookaheads and lookbehinds), which are extensions of the Regex standard and supported by several implementations. When Regex with lookarounds is used for parsing—for example—a mailing address, in which most of the items are optional and the order can vary, the corresponding regular expressions becomes very long and hardly maintainable. Many of these expressions would contain identical or almost identical pieces corresponding to the same data items, with no suitable way to avoid duplication or to keep them in sync. Writing or modifying of such regular expressions becomes fairly complicated, and the processing efficiency is poor.
Another approach that can be used to address the problem is to associate every composite data structure with an executable module or procedure that provides parsing logic and returns the identified parts. This is an extremely powerful approach, since it can provide a custom implementation that is most suitable and most efficient for every composite type. For example, it can use Regex or other techniques for identification of particular data parts, while the program keeps track of the logical dependencies and already identified items. The primary limitations of this approach are the development cost associated with support of a new composite type and the high cost of maintaining it and making modifications to the program.
Therefore, it is desirable to develop improved methods for identifying data items in loosely-structured data. This section provides background information related to the present disclosure which is not necessarily prior art.