Parsing of unstructured documents to create data that can be stored in a structured form is a well established technique. However no parser is perfect and therefore essential data may still be embedded within a document that the parser has failed to discover.
Parsers are typically used to identify entities of specific types occurring within documents. An example of such a parser is one that extracts an entity type, such as, a person (wherein “a person” is the entity type and “John Smith” is an example of an entity), an organisation etc. using some form of rule.
For example a rule that specifies “at least a first and second word beginning with a capital letter followed by “lives in a house”” can be used to identify a name associated with a person entity in a particular document e.g. John Smith lives in a house. Note that simply searching for keywords (e.g. John Smith) is not satisfactory, since the keywords will not find any other person entity in the document (e.g. Joe Bloggs).
However, the same rule will not identify the same name in another context elsewhere in the same document or in another document. For example wherein the context is “is a scientist”—i.e. John Smith is a scientist. Thus, another rule is required to identify the entity in this context.
A prior art solution to this problem is to provide for manual identification of missed entities and then to update a rule set with a rule that can find the missed entity. Disadvantageously, this is an overhead and also, a user undertaking manual identification can make errors by creating a rule that incorrectly identifies an entity (e.g. an entity of an entity type person is identified as an entity type of organisation).
Another prior art solution is to use a database of known entities (e.g. a database of known names) to identify an entity in a document. However, disadvantageously, creating a database of known entities is a very laborious task and needs continual maintenance (e.g. for updates, deletion etc.).