1. Field of the Invention
The invention is concerned with identifying and extracting information from character strings and documents, such as technical or scientific documents, and in particular documents in a specific field or domain.
2. Description of the Related Art
There are currently two main techniques of named-entity recognition (a more general problem of Terminology Recognition): rule-based methods and heuristic methods. Each technique has its own set of strengths and weaknesses.
Rule-based methods extract entities based on linguistic rules constructed by domain experts. Heuristic methods extract entities based on measurable features. Various properties could be used as features, for example where the entity is in relation to nearby words, or where the entity is within a table or document.
The linguistic rules required by rule-based methods are costly to produce and require input from both domain experts and linguists. Results from rule-based methods tend to be precise, but the inflexible nature of the rules mean that entities which differ slightly from the established rule are either mishandled or ignored. The second approach (which is more widely used) is based on machine learning and heuristics. In order to extract and recognise the semantic type of any entity, most machine-learning techniques require training, performed by providing examples of correctly extracted entities. The learning algorithm uses the training data to select features on which classification decisions are based. In the case of text documents, features such as the words immediately preceding or following an entity can be used to classify the entity—for example, a system can be trained to identify the token after the feature “engine serial number:” to be the serial number for that given engine.
One method of named-entity recognition (the extraction and identification of people, places, object, etc from text) in service documents is achieved by using gazetteer lists; these are lists of the terms expected within the domain of interest.
Lists of terms for different semantic types are produced, synonyms are added and any inflected forms are generated. Examples of entries in a parts gazetteer include:                “PYLON INTERFACE I/O 1 (D71423P) CONNECTR”        “PYLON INTERFACE, I/O 2 (D71419P) CONNECTOR”        “tr opening actuator”        “valve-anti ice pressur”        “RB211 524H-T36/11”        
These terms are sought within documents, and if they are found the corresponding text is annotated with the semantic type given by the semantic class of the gazetteer list.
A study of the terms used to refer to components with a particular part number showed that, as the popularity of the part number increases, the number of terms increases in a fairly linear and proportional manner. This is shown in FIG. 1. The increase is due to term variation, spelling differences, lexical, morphological and orthographic differences, word order changes, abbreviations and acronyms. This implies that for any given concept it is unlikely that all the terms can be identified in advance.
Gazetteer lists include lists of precompiled terms with some synonyms, variations and inflections. Unless a term exists in one of the gazetteer lists, it will not be recognised. Because technical domain terms are typified by long strings of nouns where each noun represents an individual concept which itself possesses a long list of synonyms, the combinational and order differences rife in real-world documents mean that an effective gazetteer capable of recognising a fair set of terms is not only very costly to search, but costly to produce and maintain.
Although examples of the second main type, machine learning (heuristic) approaches, do not require rules and gazetteers, they do require examples of correctly extracted entities. This is often provided by domain experts spending time annotating documents. The laborious and costly nature of this process has given rise to the term ‘annotation bottleneck’ which describes the lack of quality training data.
When using heuristic methods, the assumption has to be made that the training data are representative of the corpus at large. If this is not the case, performance is drastically reduced. As examples of the prior art, the following can be considered.
US 2003/177000 (Mao et al./Verity) shows a method and system for naming a cluster of words and phrases. The system uses a lexical database which already links terms to other semantically similar terms (Similar to WordNet).
US 2009/259459 (Ceusters et al.) describes a Conceptual World Representation Natural Language Understanding System and Method that uses an ontology to represent the relations between concepts (terms), e.g. PARIS is_a FRENCHCITY. This relies on the string of characters ‘paris’ being recognised as the entry “paris” in its lexicon/dictionary/database.
US 2005/119873 (Chaney et al./Udico) is concerned with a method and apparatus for generating a language-independent document abstract, and relates to a method of document summarisation with a language-agnostic methodology. The method does not make use of any external resources; instead, it uses statistics based on the length of words as an indication of term importance.