It has been proposed to use information extraction procedures to reduce human effort when preparing records for inclusion in a database. For example, co-pending International Patent Application No. PCT/GB2007/001170 discloses information extraction procedures which make use of a computer-user interface that facilitates the review by a human curator of automatically extracted data in the form of annotation data concerning entities and relations between entities, in digital representations of documents comprising natural language text. The digital representations of documents are displayed to a user with individual instances of automatically identified entities and relations highlighted at their location within the documents. The annotation data may be edited by a curator or selected annotation data may be used to pre-populate fields in a provisional record for review and editing by a curator.
US 2006/0053170 discloses a system for parsing and/or exporting one or more multi-relational ontologies by applying a set of export constraints to one or more master ontologies. This is an example of a system in which unstructured data sources are text mined and curated.
It has been found that the use of automated information extraction methods to extract provisional data which is presented to a human curator for review speeds up the process of creating databases of information relating to documents including natural language text. However, previous work has assumed that conventional information extraction procedures are suitable for this task without modification and that human curators respond to information in the same way. The present invention aims to provide improved information extraction procedures to facilitate the preparation of databases relating to information derived from natural language text.
The invention will be discussed further using examples taken from the field of analysing biomedical scientific literature, for illustrative purposes. However, the invention is applicable to the analysis of any other type of document which includes natural language text.