The present invention relates to the natural language processing of biomedical text. In particular, the invention relates to methods and data-processing apparatus for carrying out term normalization of individual mentions of subcellular entities in natural language biomedical texts which address the problem of disambiguating between subcellular entities from organisms of different species that may be represented in natural language biomedical text using identical character strings.
Within this description and the appended Claims, “term normalization” refers to a natural language processing procedure in which a mention of an entity is assigned a normalized identifier which uniquely denotes the entity to which that mention refers. Thus, different character strings which may be used to refer to the same entity in natural language text can be represented by the same identifier in subsequent processing. Typically, the identifier will be an identifier of data concerning that entity in a database. The database will typically include a character string which represents a canonical name for the entity.
The present invention relates in particular to the term normalization of mentions of subcellular entities in natural language biomedical text documents, such as scientific papers, including conference proceedings, and patent publications.
Within this description and the appended claim, the term “subcellular entities” includes genes and other DNA sequences (such as non-coding DNA), proteins (irrespective of length), other macromolecules (for example, fatty acids, polysaccharides, and RNA such as mRNA, tRNA, rRNA, dsRNA and non-coding RNA), and also intra-cellular structures, such as organelles, whether or not they are located within a cell in the context of the document which is being analysed and whether or not they originate from a cell. For example, the term includes viral proteins, genes, macromolecules and structures.
It has been found that mentions of subcellular entities in natural language biomedical text are frequently ambiguous. Ambiguity arises from several sources. Biologically-relevant entities may have multiple names which are in common use, such as acronyms, abbreviations, synonyms and morphological variations; a single name can refer to more than one biological entity; and common English words are often used to refer to subcellular entities. The problem of ambiguity is especially severe when the text which is to be analysed may relate to one or more of a number of species of model organisms. In practice, the biomedical literature discusses subcellular entities from a great many species.
By way of example, table 1 shows three abbreviated records taken from the RefSeq protein database (see the ncbi website with the extension “nlm.nih.gov/RefSeq”). The protein known as interleukin 5 precursor (i.e., NP—00870) may well occur in natural language biomedical text as one of its synonyms such as interleukin 5 or IL5, and sometimes even Interleukin-5 or IL-5, which are not recorded here. Furthermore, the term IL5 may denote any of three proteins shown below, which present a problem for a natural language term identification system.
TABLE 1RefSeq IDSpeciesProtein SynonymsNP_000870Homo Sapiensinterleukin 5 precursor; interleukin 5; IL5NP_068606Rattusinterleukin 5; IL5NorvegicusNP_783851Homo Sapiensinterleukin 5 receptor, alpha isoform 2precursor; IL5
To date, the problem of ambiguity between species has been considered to present a serious problem for natural language processing. Research work has been carried out to study the ambiguity in biological names. For example, Chen et al. investigated gene name ambiguity and quantified the extent to which the use of the same gene names across various model organisms leads to gene name ambiguity in general (Chen, L., Liu, H., and Freidman, C., Bioinformatics, Vol. 21, No. 2, 2005, pp 248-256). Chen et al. encouraged authors of biomedical publications and journal editors to use only official symbols and to avoid using aliases, particularly those that coincide with other English language words, and to revise naming conventions to reduce ambiguity. Chen et al. proposed the development of techniques to categorise an article based on domain or species, to help reduce ambiguity. However, even if an effective method of categorising articles based on domain or species was developed, this would not, in itself, facilitate the accurate normalization of terms in a document relating to organisms from more than one species.
The normalization of gene names across data sets relating to several different species was addressed by BioCreAtIve I, Task 1B, Gene Normalization (see the biocreative.sourceforge website with the extension “.net/biocreative—1_task1b.html”). However, in this task, each analysed abstract related only to a single species, which was known. It is a separate problem to accurately normalize terms in a document which may relate to organisms from more than one species.
Accordingly, the present invention seeks to address the problem of carrying out term normalization of mentions of subcellular entities in biomedical natural language text documents while disambiguating, where possible, between species. Of course, one skilled in the art will appreciate that term identification is not a perfect science and it will not always be possible to correctly identify the species of a mentioned subcellular entity.