Electronic documents or passages of text otherwise stored electronically (such as stored directly on web pages accessible via the internet) can contain large amounts of information either in a single source or over multiple sources. With regards to electronic documents this is particularly relevant to the review of vast amounts of electronic documents, be it those originally in electronic form or those which have been converted into electronic documents, where particular types of passages or groups of text have to be identified. For example, it could be necessary to search through a document or number of documents to identify entities (typically proper nouns) or passages relating specifically to entities. For example, in legal due diligence, it may be necessary to extract all sentences relating to a specific corporate entity or specific person, who may be referred to in different ways in different documents.
Prior art solutions thus far proposed have focused generally on (a) finding an exact or near match against a list of known entities within the language used in the documents being searched and/or (b) analyzing the grammar and syntax of the document to infer that the tokens being used as nouns may represent entities, the most obvious example of this including identifying capitalized nouns. However, prior art solutions suffer from accuracy issues in that they depend on the presence of tokens (ie. features of the text beyond their linguistic meanings, grammar or syntax) without taking into account the semantic context of an particular phrase. For example, consider the phrase, “That's excellent art”. Prior art solutions would not immediately be able to determine whether this phrase is a complement to a person named “Art” or is referring to a compliment paid by a supervisory examiner to a junior examiner in relation to prior art found during a patent search. Similarly, the phrase “ . . . in the morning he would wave to the smiths on their way to work . . . ” leaves it unclear whether the “smiths” refers to a family or group of tradespeople.
FIG. 1 illustrates the general state of the art in which an entity extraction training algorithm is run on a set of training documents to develop an entity extraction model. This model is then applied iteratively to a set of documents to analyze and a list of the named entities are extracted.
There is accordingly a need in the art for an improved method and system for identifying or extracting entities in electronic documents.