Names, chemical formulas and structure diagrams are the language of chemistry. In any subject where objects can be expressed in a variety of languages, there is an interest in and a need for translation between the different expressions that describe those objects. A need for nomenclature arises when chemists have to communicate the information on compounds by spoken or written word, in the latter case usually where a structural diagram (unambiguous and unique) is for some reason inappropriate or cannot be used.
The nomenclature used to describe chemical structures is a language and thus may be handled, when translated into another representation, using methods of linguistics1-3. The human mental process for arriving at the structure from a chemical name appears to be a rule-based linguistic approach. As in linguistics, there is a struggle between pragmatists, who regard as satisfactory any word that conveys the intended meaning, and the purists, who insist that rules ought to be followed, with, unfortunately for the computer, the pragmatist having the advantage. Thus, the dedicated organizational body, Commission on the Nomenclature of Organic Chemistry (CNOC) by the International Union of Pure and Applied Chemistry (IUPAC) (http://www.iupac.org) which since 1938 has been responsible for inventing, monitoring, and revising the recommendations that are guidelines to the systematic nomenclature tries to see nomenclature as a whole, codifying already existing usage into rules and only very occasionally suggesting novelties.4 Though the system has been developed over 110 years (initiated by the historical “Geneva Conference” in 1892), it is far from perfect and has not become a universal standard.5 
In the meantime the CNOC ceased to exist and was replaced (in January 2002)—also within IUPAC—by the Division of Chemical Nomenclature and Structure Representation http://www.iupac.org/divisions/VIII/) whose main tasks are to co-ordinate efforts at nomenclature systematization and to supervise all relevant activities and projects of the chemical community directed toward unambiguous structure representation(s). Typically this includes computer representation6-8 for local computing as well as for distributed computing in intranets and Internet (mainly web-based).
For the purpose of clarity in the selection of preferred names, the two most important producers and distributors of chemical information (Chemical Abstract Service (www dot cas dot com) and Beilstein Institute (the Beilstein file is now provided and maintained by MDL—at www dot mdli dot com)) devised non-documented ad hoc sub-rules, which only amplified the problem of uniquely naming organic compounds. These rules were necessary since IUPAC recommendations frequently allow more than one name for a given chemical compound. As a result, both institutions revised the IUPAC system and created their own “systematic” IUPAC-compatible (rather than IUPAC-sanctioned) nomenclatures. In addition, trivial and trade names, being shorter and more concise, have successfully replaced systematic names for a number of chemical compounds which are of commercial importance or are the subject of public concern 9, e.g., pharmaceuticals, insecticides, and pollutants). Both CAS and Beilstein claim to conform to the IUPAC rules, and in general this is true. The IUPAC recommendations were consciously formulated to allow considerable freedom in their application, and in many cases are not fully defined to their logical conclusion. In practice, this means that any given structure does not necessarily relate to one unique correct name. Thus, the specific “dialects” supported by CAS and Beilstein can still represent systematic nomenclature no matter how far apart they are. This, as far as computer usage is concerned, is the greatest weakness of the nomenclature.
The average user cannot find clearly defined “dialects” of IUPAC. This has also hindered solving the difficulties in establishing an unambiguous nomenclature standard. As long as such a standard does not exist, the practicing chemist will find himself to a great extent alienated from systematic nomenclature. But even if a sort of consensus is achieved and an unambiguous nomenclature standard is worked out and adopted, there is still the problem of nomenclature complexity. It is generally accepted that IUPAC nomenclature is cumbersome, with a very large number of rules, which are often very difficult to follow. Frequent alternatives allowed in name assignment, contradictory recommendations, the lack of rules in certain areas, and the exaggerated freedom in interpretation of the rules lead to ambiguity and specific nomenclature chaos.
One basic problem of naming is that a correct name is not necessarily the only correct name for a structure. To complicate matters, the rules for arriving at a correct name, as discussed above, are complex, and very few chemists can handle them. Even worse, the important centers for chemical documentation in the world are not uniform, either internally or externally, in their treatment of the rules. This is not the result of carelessness or lack of effort; it is simply a reflection of the difficulty on agreeing how a multi-dimensional problem can be forced into a single, universal text description. The structure shown in FIG. 5 illustrates the problem.
In principle, there is nothing wrong with a multiplicity of names for structures. As long as each name is an adequate representation of the structure, there are few real problems, apart from ensuring that chemists are reasonably familiar with the rules in a passive sense (i.e., can interpret a name, as opposed to creating one). However, the traditional (attempted) use of nomenclature has been much greater in its scope. Before computerization, the ideal was to index each significant structural sub-unit of the structure using nomenclature. The structure should be intuitively broken down into areas of relevance (acetaldehyde, benzene, ethane) and these are bound together into a text by use of locational parameters (1, 2, α). This approach is based on chemical experience, and is by no means bad. But it contains the limits of its own applicability insofar as the vocabulary used has never been fully standardized in a strictly defined sense, and the intuitive subdivision has never been fully cleared of internal contradictions. This has meant that the use of indices based on names or parts of names remains to this day a hazardous business. To use the above example, it is not immediately obvious to most chemists whether they should be looking under A (for acetaldehyde), B (for benzene), or E (for ethane). A computer system able to generate names algorithmically, and using the same rules of relevance would lead always to the same index name, thus solving the problem once and for all7. Such names could be then reversibly and unambiguously translated back into the same structural diagram.
This is unfortunately not the case at all. Systematic nomenclature as recommended by IUPAC failed to become a standard. As discussed above, trivial or trade names, being shorter and more concise, have successfully replaced systematic names for a number of chemical compounds which are of commercial importance or the subject of public concern. Any comprehensive computer program designed to deal with real-life chemical nomenclature has to be able to convert semi systematic, asystematic, obsolete, ambiguous, and otherwise “corrupted” names that are the reality of present chemical communication.
Translation of chemical names into structures can in general be treated as a problem of computerized syntactic and semantic analysis of nomenclature as an artificial language. In order to achieve such an analysis, a formal grammar of nomenclature must first be derived from informal rules. From the linguistic point of view, it is an interesting observation that the basic language of all naming systems in organic chemistry is essentially the same. While two chemists will name the same compound differently, both will be able to draw the same structural diagram. In this sense, the above-mentioned use of different naming practices corresponds to the problem of handling dialects, rather than a treatment of separate and distinct languages.
The knowledge of formal grammar of the chemico-linguistic requires the creation of a dictionary of fragments (so called morphemes) from which the names can be built, and the elucidation of appropriate syntax rules to govern that building.2 The fragments are then grouped into numbered classes, and rules written in terms of these to define phrases so that each rule is referred to by its associated phrase name. For example, one rule can simultaneously allow for the fragments “meth,”, “eth”, “prop”, etc., in the same context. The morphemes must then be localized and recognized within a supplied name. The process includes first parsing the name by breaking it into longest possible text fragments and then submitting the fragments to lexical analysis in order to identify the fragments, according to a set of syntax rules, with use of the pre-defined dictionary9. Taking into account the numerous semi systematic fragments retained by IUPAC (e.g., acetic acid instead of systematic ethanoic acid) a only functioning parser will have to work with an extremely large dictionary of morphemes. Once a valid name (the problem of allowed valid names has been already mentioned above) has been successfully parsed, appropriate routines are to be invoked in order to process semantic information as each syntax rule is obeyed. The morphemes localized in the name are then associated with corresponding structural fragments stored in a compact form as small connection tables. These are then combined and ordered together into the final complete connection table (CT) corresponding to the complete name. Graphical routines transform the connection tables into structural diagrams and deliver them as output on terminals or in printed form10.
Conversions of the sort outlined above have a long tradition. The first use of computerized grammar analysis process, with very limited dictionary of nomenclature terms in comparison with the broad range of constructions allowed in the IUPAC nomenclature, was by Elliot.11 Later, practical operational computer programs based on such procedures were reported by CAS12, where they were used to validate the CAS index for the CAS Index File. Approximately at the same time Stilwell13 and later Cooke-Fox et al14 reported a very interesting grammar-based nomenclature translation for steroid nomenclature. Another system, excluding, however, semi systematic and trivial fragments from the dictionary of morphemes, was reported by Carpenter15. The most advanced research to date of the grammar based translation of IUPAC nomenclature into structural diagrams has been conducted by the team at the University of Hull2,9-10,14,16-17.
The first functioning practical system translating names into structures (called VICA) dates back to 1986 and was developed by Domokos and Goebels for the IBM mainframe computer in the Beilstein Institute in Frankfurt/Main, Germany. It had been successfully applied in Beilstein (reaching a success rate up to 95%) for Beilstein nomenclature only and was never used outside Beilstein. Except for internal Beilstein memos and technical documents, there are no reviewed publications to which one might refer. The format of the input chemical name accepted by VICA (written in Pascal and Fortran programing language) was strictly defined for the syntax of the systematic nomenclature as used in the “Beilstein dialect” (specific delimiters, specific handling of post-suffixes such as esters and amides, specific syntax of multicomponent structures, etc.).
Another interesting attempt in the area of algorithmic name conversion is ROXY, a system designed and programmed in 1993 by Lawson.18 This Visual Basic program works with a very small dictionary (approximately 500 entries) of pre-defined name fragments, very successfully generates fused and annelated ring system connection tables using strictly algorithmic mechanism (without database lookup) and reaches, for real-life names, a success rate up to 21%.
Recently a few interesting practical (and commercially available) computer systems translating nomenclature into connection tables were released. The first one comes from CambridgeSoft Corporation, Cambridge, Mass., USA and is known under the name “Name=Stru”. Its latest version is included in the structure editing package ChemDraw Ultra and the chemical office suite ChemOffice Ultra.19 The success rate (ratio of correctly generated structures of the total number of structures in the test sample) as reported by Brecher in his paper20 varied from as high as 92% to as low as 33.5% depending on the quality of names in the source test sample.
The “Name=Stru” system has a few limitations. Cahn-Ingold-Prelog (CIP) stereochemistry (R/S, E/Z) is not supported, and some classes of bridged ring systems are neglected. The system is unable to handle names of polymers and those of inorganic coordination complexes. Also the subtractive nomenclature (de-, des-etc.) stays fully unsupported.
The paper by Brecher includes a detailed description and classification of problems encountered by anyone attempting to design an automatic nomenclature converter. These problems—according to Brecher—arise mainly from the ambiguity of current nomenclature practices.
Advanced Chemistry Development released another program of this type. (ACD Labs, Toronto, Canada). This program is able to exceed in many cases the success rate of the “Name=Stru” program. “ACD/Name to Structure” is offered as an interactive or a batch version (a conversion session can be launched not for a single name, but for a file of input names). The program is claimed by ACD Labs21 to be able to generate chemical structures for names of most classes of general organic compounds, many derivatives of more than 150 basic natural product parent structures, and semi systematic and trivial names of common organic compounds.
The batch version of the name converter from ACD Labs (“Name to Structure Batch”) generates structures from systematic and non-systematic chemical names of general organic, some biochemical, and some inorganic compounds. The input for this program can be native, ACD ChemFolder *.cfd format files, regular ASCII text files, or MDL *.db or *.sdf files. Recently, the functionality of the program was extended and Name to Structure Batch can also convert SMILES strings directly into chemical structures. The program is also available for UNIX platforms. This is particularly important since most of the intranet systems for small-scale chemical databases run on UNIX mini-computers.
Yet another name-to-structure converter comes from ChemInnovation Software, Inc., a company based in San Diego, Calif. The program is named NameExpert. The program is more academic than practical (mainly due to an unacceptably low success rate).22 The program understands strict systematic IUPAC organic nomenclature. For an input IUPAC chemical name, it creates the corresponding structure in one of three styles: shorthand, Kekule, or semi structural formula. In addition, it can add labels to appropriate atoms and groups. The newest version now supports limited stereochemistry, and includes 8000 drug names and structures.
To make the list of available name-to-structure software packages more complete yet another program must be mentioned, namely IUPAC DrawIt released by Bio-Rad Laboratories Corporate, Hercules, Calif., USA. It cannot be considered in any circumstances as a nomenclature tool for practical corporate use.23 The main restriction is the maximum number of heavy atoms allowed in the resulting output structure, which is set to 10. The program is relatively effective for strictly systematic IUPAC names, but for common nomenclature like that found in today's literature, the program can offer no more than a single digit success rate. Thus it can be under no circumstances considered as any alternative or competition for Name=Stru or for ACD/Name to Structure.
Chemical nomenclature, and organic nomenclature in particular, published in the literature (journals, patents, technical documentations, etc.) is generally of poor quality. Published rules (e.g., IUPAC) are commonly ignored, misinterpreted, corrupted or extended at will. The nomenclature which today is regarded as “systematic” is defined by the consensus of users' opinions. A “correct name” does not exist. There are “common sense” naming practices e.g., those confined within the Beilstein or CAS “dialects”.
Previous software for extracting information from text often produced unacceptable results in terms of accuracy and comprehensiveness. In order to produce extractions with acceptable accuracy and comprehensiveness, a human indexer would be used. However, the use of a human indexer is time consuming and expensive.