Many organizations are now equipped to receive messages, including orders and enquiries, by electronic means. Typically, such electronic messages take the form of text-based messages, for example e-mails, delivered by a global computer network, for example the Internet, or by a telecommunications network, for example a mobile telephone network. Each message must be processed and dealt with appropriately. In many cases, the volume of received electronic messages is relatively high and it is considered inefficient to process each message manually.
For this reason, it is known for electronic message processing systems, typically in the form of a computer system, to employ a text analyzer, such as IBM's Mail Analyzer, to analyze the content of electronic messages in order to classify, or categorize, each message according to its content. Once a message has been categorized, the processing system sends it on to a human operator who has the skills necessary to deal with messages falling within the relevant category(s). Alternatively, the computer system itself may be able to deal with messages falling within certain category(s).
Typically, a text analyzer examines the text of each message in turn in search of one or more alphanumeric text string, for example a word or sequence of words, which may be used to identify the purpose or nature of the message under examination. It is known for the text analyzer to operate in association with a rule engine to apply a set of rules to the message in order to determine how to categorize the current message.
By way of simplistic example, consider a banking organization which has a message processing system arranged to receive electronic messages in three different categories, namely: balance enquiry; request for funds transfer; and interest rate enquiry. In order to categorize each received message, a text analyzer in association with a rule engine applies a set of four rules to each message in turn. The first rule stipulates that if the text string “balance” appears in the message, then the message falls in the balance enquiry category. The second rule stipulates that if the text strings “funds” and “transfer” appear in the message, then the message falls in the funds transfer category. The third rule stipulates that if the message contains the text string “interest rate”, then the message should be categorized as an interest rate enquiry. The fourth rule stipulates that if none of the previous rules are satisfied, then the message is deemed unclassified. Clearly, a message may fall within more than one category.
The message processing system may be arranged to distribute all balance enquiries, fund transfers and unclassified messages to an appropriate human operator, while being arranged to send out interest rate information itself.
Such systems are suited to processing relatively small volumes of messages falling over a relatively small number of categories but exhibit serious shortcomings when dealing with large volumes of messages and a large number of categories.
It is increasingly common for organizations to receive up to hundreds of thousands of electronic messages each day, each message requiring classification into one or more of typically hundreds of different categories. To perform message categorization, a text analyzer would typically apply a set of several hundred rules to each message, the final classification of each message being derived from a combination of the results the application of all of the rules to that message. Conventionally, a flat rule structure is employed meaning that each rule is given equal weight and is applied in sequence to each message, one message at a time. This requires a large amount of computer processing power that can lead to unacceptable delays in dealing with incoming messages.
Further, for a complex taxonomy, the precision (i.e. the degree to which messages are categorized correctly) and recall (i.e. the degree to which a complete set of message categorizations are produced) are difficult to maintain at a high level.
It is also problematic to administer a single set of rules over a complex organization having a number of different divisions or sub-organizations since one or more categories may need to be defined or characterized differently to account for differences in culture, regulation, market segmentation, brand specificity, or the like. Similar problems arise where more than one organization shares a single message processing system (and therefore a single set of rules) through, for example, an Application Service Provider (ASP). There is a potential for conflict over the rules in that the rules for one organization, or sub-organization, may affect the application of the rules of another organization, or sub-organization, and so lead to inappropriate categorizations.
This problem is compounded when the different organizations, or sub-organizations, need to be able to receive messages in different languages. A particular problem that multiple languages causes concerns the performance of lexical analysis (sometimes known as word-stemming) on the message under examination. Lexical analysis is desirable since it enables the text analyser to recognize different forms of words, such as plurals and participles, and so helps messages to be categorized correctly. In general, Lexical analysis is typically performed using a dictionary but a conventional text analyzer can only operate with one dictionary at a time. If, for example, a text analyzer is initialized with an English dictionary, then any messages received in any other language cannot benefit from lexical analysis.