Content-based publish/subscribe messaging requires access to arbitrary message fields in each network node in order to route messages. Messages arrive as byte streams and only a few of the message's fields need to be accessed. However, the fields used to make routing decisions may be anywhere in a complex structured message. The property of random access to fields in a byte stream enables routing decisions to be organized optimally without regard to the order in which information is extracted from the byte stream, and it completely avoids any overhead associated with parsing information that isn't needed. The same property of random access to information structures stored in a byte stream form is useful in other systems as well, for example, database systems.
It is well known that untagged binary formats can provide constant time random access to fields in a byte stream by using offset calculations (perhaps indirected through offsets stored in the byte stream). However, this only works when the information structure is “flat” (does not involve any nesting of information). In practice, most information structures are not flat.
An information structure with a flat structure may be characterized as a tuple (or “structure” or “record”). The schema for such an information structure calls for a fixed sequence of fields. In this description, we use the notation [ . . . , . . . , . . . ] for tuple schemas. So, [int, string, boolean] might be the schema for an information structure containing an integer followed by a string followed by a boolean.
Ways in which information structures such as messages nest information and therefore deviate from flatness include at least the following.
Tuples may be nested. That is, the schema for an information structure might be [int, [int, string, [string, boolean]]].
Any schema element may be repeated zero or more times, forming a list. In this description, we use the notation *( . . . )* for a list in a schema. So, *(int)* means a list of zero or more integers. A list of tuples (often called a “table” or “relation”) is also possible. So, *([int, string, boolean])* is the schema for a table with three columns (an integer column, a string column and a boolean column) and zero or more rows. In most relational databases, each row is a flat structure. But, in messages and advanced databases, each row may have nested tuples and embedded tables, with no intrinsic limit to how deep such nesting can go. Tuples and lists must be allowed to nest in arbitrary ways to accurately describe information structures in general.
Information structures may be recursive. For example, a field of a tuple may be defined as another instance of the tuple itself or of an encompassing tuple or list (this cannot be illustrated readily with the present notation).
Information structures may include variants in additions to tuples and lists. A variant indicates that either one type of information or another (not both) may appear. Information structures may also include dynamically typed areas in which any kind of information may appear.
It is common to define certain columns of a table as key colums. A lookup in the information structure requires finding a particular value in a particular column of the table, after which only that row (or only a specific field from the row) is accessed. In a database, an index might be built in order to do this efficiently. In messages, the tables are rarely large enough to benefit from a precomputed index, and transmitting such an index in the message adds unacceptable overhead. So, for utility in the messaging domain a processor should be able to scan just the key column (sequentially) and then randomly access just the information in its row.
In addition to what is known about using offset calculations to provide constant time access to completely flat information structures like [int, string, boolean] said techniques are readily extended to encompass just nested tuples (with no lists) such as [int, [int, string, [string, boolean]]] (by treating it as if it were [int, int, string, string, boolean]. This is what is done, for example, in an optimizing compiler when compiling code for nested struct declarations in (for example) the C language.
A tuple containing fields of varying length requires some pointer indirection in order that all the offsets are still known. For example, if int and boolean have a fixed-length representation but string does not, then we might represent the two string values in [int, int, string, string, boolean] as fixed-length pointers to strings stored elsewhere in memory. That way, the last two fields of the tuple are still at a fixed distance from its start (which is how programming languages solve the problem). It is well-known that a pointer to elsewhere in memory can be represented as a stored offset to elsewhere in a byte stream. So, this issue is solvable for byte streams as well as computer memories. Solutions like this are embodied in many Internet protocols to speed up access to information following a varying length field.
A simple table (where each row is flat because there are no nested tuples or lists) can be stored in either row order or column order. Varying the storage order for simple multi-dimensional arrays is a well-known technique for optimizing compilers. Relational databases often store tables in column order, since this can improve scan time for key columns that lack indices. However, in messaging, the representation is usually a tree structure and serialization of messages is done by recursive descent, which results in storing all tables in row order. In any case, the well-known technique of storing tables in column order must be extended in non-obvious ways to be useful when schemas use arbitrary nesting of lists within tuples within lists.
Schemas whose structure is inconvenient can sometimes be transformed into isomorphic schemas that are more convenient. The flattening of [int, [int, string, [string, boolean]]] to [lint, int, string, string, boolean] is an example of one such isomorphism. The same kind of flattening can be applied to variants. It is also known to those skilled in the field of type theory that tuples can be distributed over variants to yield an isomorphic schema. If we use the notation {int|boolean } to mean the variant whose cases are int or boolean, then [string, {int|boolean }] is isomorphic to {[string, int]|[string, boolean]}. This observation has been used to improve message processing time in IBM web sites employing the Gryphon system since 2001, and also in the IBM Event Broker product.