Parsing of data typically involves an initial lexical analysis step which involves breaking up the input data into logically separate components, such as field names, constants, operators (the lexical analyser outputs a string of ‘tokens’) and then a syntactic analysis step which involves processing the tokens to determine the syntactic structure of the data as defined by a grammar. A lexical analyser may also remove redundant spaces and may deal with character-set mappings (e.g. replacing upper case letters with lower case). The term parser is used to refer to a program which performs such analysis. The output of a syntax analyser may be a syntax tree (also known as a parse tree) which represents the syntactic structure of the string of data being parsed. Parsing is well known in the context of compilers, but is a required step for many data processing operations. For example, a message processing program may need to parse an input message to model the message structure before it can begin processing the message—for example to understand the structure and format of the message before performing format conversion. This may include separating a message descriptor into a set of fields comprising name-value pairs so that the different values in the named fields can be processed, and similarly separating a stream of bits comprising the message data into name-value pairs so that the data can be processed.
It is now very common for a computing network to integrate many heterogeneous systems, and individual messages sent across these networks may include different data formats within headers inserted by the different systems through which the message passes. It is therefore important for a network-wide messaging service to be capable of handling a number of different data formats within a single message, to support the increasing requirement for business and system integration. It is a feature of some existing messaging products to be able to parse an incoming message which includes a number of different format components, splitting the message into its differently formatted constituent parts and separately parsing these parts to generate an output message for further processing.
In the past, these messaging products have relied on predefined message formats and either a generic parser or a single parser-selection process which has access to a repository of message formats. Either the generic parser analyses all components itself or a process scans through the message to identify the differently formatted components, comparing the identified formats with those in the repository, and then a selector assigns each component to a specific parser which is capable of performing syntactic analysis for that component.
This approach has proven satisfactory for situations in which only a small number of formats are possible and the format and sequence of all the message components is known in advance, since this knowledge enables the selection of appropriate parsers for the different components. It may also be possible in some cases, although inefficient, to rely on a generic parser to perform an initial analysis and resulting parser-selection before the main syntactic analysis begins. However, this would require the generic parser to be capable of analysing all data formats within each message so that the first scan of the message could break the message into components and could provide the information to determine which specific parser should handle which components. It may be very difficult to implement such capability within a single generic parser. Secondly, separating syntactic analysis into separate first and second steps would entail processing delays and would tend to duplicate some of the analysis.
An alternative to this initial step of a generic parser analysing the whole message is for individual selected parsers to perform syntactic analysis of specific parts of a message, including identifying the data format or data type of the next message component in the sequence of bytes which makes up the message. In this case, a parser-selection process can identify the format of a first component and select a first specific parser; the selected parser can then parse this first component, identify the format or type of the next component and send this format or type information to the parser-selector to invoke another specific parser for this next message component. As noted above, such solutions rely on knowledge of predefined message formats and predictable sequences of message components. This has proven satisfactory for handling differently formatted message headers, if the selected parsers are given the knowledge of which field to read to determine the format of the next message component, and if the data within the body of the message has a single format such that a single parser can parse the entire contents of the message body. The class name of the required parser can be included in a format field of a message descriptor or another header of the message, and this can be read to select and then invoke the appropriate parser.
A problem arises with the above solution when a message body includes multiple nested data formats, since then the reliance on the single parser-selector to call the appropriate program at the required time involves an excessive number of communication flows between the specific selected parsers and the parser-selector. It also requires the parser-selector to be able to invoke a suitable specific parser for all formats and for unpredictable nested structures, such that the selector needs a detailed knowledge of an ever increasing number of formats. The problems of this approach will become clear in future as the number of message data formats and the complexity of message contents increases with increasing systems and business integration.
Thus, there remains a requirement for an efficient solution to parsing of messages which include multiple data formats, especially when the different formats can be nested within one another and the structure of the message is not known in advance of its receipt and analysis by a messaging program. The problems of known solutions are especially acute for messages in which either individual components or the sequence of components do not have a predictable structure, since then a message analysis is required as part of the run-time operation of the messaging program before it is possible to select a parser to analyse a next component of the message.