Written works such as works of fiction often contain a large number of characters. While some written works include a character list to help the reader remember the identity and significance of these characters while reading the written work, many works do not. For such works, remembering all the characters in a written work becomes difficult for the reader especially when the work includes various names for the characters (e.g., Tom, Tommy, Thomas, etc.). This difficulty may result in confusion and lack of comprehension on the part of the reader, rendering the reading experience less enjoyable.
Automated methods for recognizing “named entities” (e.g., a person or place) in a body of text are known. These methods include the ability to determine whether different strings (e.g., “John Smith”, “Mr. Smith”, and “John”) refer to the same named entity. Further, existing methods and systems can determine a relative significance of a named entity based on the quantity of references to that named entity in the text.
The existing methods have been applied primarily to relatively short works, such as news reports, and highly specialized scientific works such as biomedical texts. Further, these methods involve a compromise between accuracy and completeness (e.g., the number of named entities identified). Therefore, configuring a named entity recognition system to return a greater number of named entities (e.g., higher completeness) necessarily results in an increased error rate (e.g., lower accuracy). Configuring for high accuracy dictates that some named entities will be omitted (e.g., lower completeness). In existing systems, the results are manually corrected, which is labor intensive and thus expensive.
Corresponding reference characters indicate corresponding parts throughout the drawings.