The present invention relates to data parsing from various electronic data formats such as legacy print stream files, electronic data interchange (EDI) files, Extensible Markup Language (XML) files, and structured text formats for use in electronic commerce such as electronic bill presentment and electronic statements, as well as for assisting in integration of legacy systems.
More and more organizations are finding themselves pressed to conduct business electronically (so-called e-business) over the internet or other computer networks. E-business calls for specialized applications software such as Electronic Bill Presentment and Payment (EBPP) applications and Electronic Statement Presentment (ESP) applications. To implement such applications, traditional paper documents have to be converted to electronic form which may be processed electronically and exchanged over the internet or otherwise with customers, suppliers or others. The paper documents will typically be formatted as Hypertext Markup Language (HTML) Web pages, e-mail messages, XML messages, or other electronic formats suitable for electronic exchange, processing, display and/or printing.
Consider, for example, a telephone company that is in the process of implementing an EBPP service. Any EBPP implementation must be integrated with the organization's existing billing systems. The straightforward approach to integrating the billing systems would be simply to get the data from the existing billing system's database and use that data in the new e-business system. This approach, however, is not as simple as it may seem. Many legacy systems do not have a standard interface for data extraction and, moreover, the information required to create an electronic document often does not exist in any one easily accessible database format. The telephone company for example might maintain three different databases feeding into its legacy billing application: (1) A customer information database containing account numbers, calling plans, addresses and other customer profile information—this database would tend to be updated infrequently; (2) a rate and tariff database containing the rate structure used to calculate the cost of calls, which is typically based on geographic zones, time of day and the like—this database would tend to be updated periodically; (3) a transaction database containing the transaction history of the calls made by customers, including number called, duration and the like—this database would be updated very frequently.
These databases may be located on three separate and distinct computer systems (e.g. IBM Mainframe, Tandem fault tolerant system, UNIX minicomputer and so on) and in three different database formats (e.g. Oracle RDBMS, flat files, IMS database, and so on). Moreover, there is typically a great deal of application logic embedded in the billing system's legacy software code, which could be in the form of a COBOL program written in the 1960s, for calculating taxes, discounts, special calling charges and so on. Because of these complexities, it is generally not possible simply to read a database to get the required billing data. Even though it may be possible to recreate a bill for use in e-business from original data sources, this would generally require a re-creation of all of the functionality that exists in an organization's existing billing system. The cost and timeframe to do this would generally be prohibitive.
For use in legacy system integration and transition to e-commerce, specialized software tools known as parsers have been developed to extract data out of legacy file formats. The known parsers are monolithic in the sense that the parsing is performed by one large program (e.g. parser.exe) for all documents and file formats. The following example points out the inherent problems with this approach.
Consider a company, for definiteness referred to as Acme Credit Card Corp., that wishes to make its Quarterly Management Report (QMR) available to customers over the Internet. The challenge is to parse the statement data out of the existing print stream created by Acme's legacy system. This print stream is located in a file called. QMR.AFP, which is in an IBM AFP file format. FIG. 1 shows the logical processing flow used by monolithic parsers.
When using a monolithic parser a developer/user 10 typically creates rules for parsing a data stream, which are applied by monolithic parsing engine 11. If these rules enable the parser to parse the input document QMR.afp from data source 12 successfully, the rules are saved in a rule base 13 and subsequently used in a production application to extract data out of the legacy format. The inherent problem with this approach is that, because of the extreme variability in legacy formats, such as print streams, it is virtually impossible to pre-define all possible rules for parsing data. In e-business applications such as EBPP, typically nothing less than 100% accuracy is acceptable. Vendors are compelled to update their parsers frequently to handle new rules. Furthermore, it may even be necessary to put customer-specific code into the monolithic parser. For example, if the Acme QMR document has an Acme-specific idiosyncrasy, which no pre-existing rule could handle, it then becomes necessary to add a new rule and update the parser code, or even worse, add custom Acme-specific code to the parser as shown in the following pseudo code listing.
If (customer=“Acme Credit Card Corp.”                and document=“Quarterly Management Report”        and special condition exists) then execute custom parsing logic        
Although the custom code approach of the preceding listing might work, nevertheless putting customer-specific and document-specific code logic in a general-purpose utility is highly problematic from an engineering and quality control point of view. Such production parser code would constantly need to be updated for specific cases, making the parser overcomplicated, which will generally result in a higher number of programming bugs. Furthermore, the extra condition checking would tend to slow the parser operation. Moreover, the inclusion of customer-specific and document-specific code in a monolithic parser makes it unduly burdensome for a software vendor to offer the same parser to many different organizations.