The ever increasing volume of information produced by society and industry has led to ever increasing difficulties in storing, finding and analysing that information. Whereas there was a time when information, such as scientific and technical literature, could be adequately stored in printed form and indexed by hand, that time is now in the past and electronic storage, retrieval and analysis systems are an essential part of the modern world.
Some types of information processing can be adequately addressed by computerised analysis alone. For example, searchable directories of web pages can be automatically prepared without human intervention and used to store large volumes of information and to retrieve this information in response to queries, such as which web pages include specific words.
However, some information processing tasks cannot be automated, or cannot be automated to the standard which would be achieved by a human. For example, the accurate automatic analysis of documents comprising natural language text constitutes an especially difficult problem.
The automatic analysis of natural language text documents is addressed by the growing scientific field of natural language processing (NLP), also referred to as computational linguistics. NLP has been used to carry out tasks which previously required to be carried out by humans, but remains an imperfect science under continual development. Although it is often desirable to use automatic methods of analysing natural language, rather than human analysis, due to the cost and speed benefits of computerisation, there are many applications where human analysis remains essential.
One example of a field where there is a large volume of information, which would ideally be analysed automatically where possible, is the scientific literature, for example the biomedical scientific literature. In order to make new scientific discoveries and draw conclusions from existing data, it is desirable to be able to store and recall information concerning relations between biological entities which are mentioned in the scientific literature. For example, where a scientific paper provides evidence to support a hypothesis that a first protein interacts with a second protein in vivo, it is desirable to store that information in a searchable database. Such databases can be valuable aids to technical progress.
International Patent Application Publication Number WO 2005/017692 (Cognia Corporation) describes a relational database for use in biomedical research which includes information about entities (such as proteins, genes, compounds etc.) and interactions between these entities. Data concerning interactions is stored in the database along with references to scientific papers which provide evidence for the interactions. Thus, the database can be queried by users not just to find out information about entities and interactions between entities, but also to thereby identify relevant sources within the scientific literature. Data is entered into the database by human curators who read scientific literature, identify entities referred to in individual documents and relations which are hypothesized, discussed or proven by data within those documents. A computer-user interface is provided to curators which allows them to input data by selecting options via an ontology browser which, amongst other data, defines normalised forms for the names of entities. Thus, the data inputted by the curators uses standardised terms, which avoids entities being referred to by different names and thus improves the quality of the database.
However, a disadvantage of the system described in WO 2005/017692 is that it requires a substantial amount of time to be spent by skilled curators to compile the database, which can be costly.
PCT/GB2007/001170 (ITI Scotland Limited) discloses an information extraction procedure in which annotation data concerning instances of entities in a digital representation of a document, including the location of the instances of entities within the digital representation of a document, is automatically prepared by information extraction apparatus and presented to a human curator for review, using a computer-user interface. This arrangement reduces the time required by human curators to compile a database.
The present invention aims to provide an improved computer-user interface for use in reviewing data which has been automatically extracted from digital representations of documents by information extraction apparatus, for example, for use by a curator while reviewing data for export to a database.