In recent years, with rapid development of the Internet and widespread use of various electronic products (e.g., mobile phones, personal digital assistants (PDAs), notebook computers and electronic books (E-books)), more and more manufacturers and users now provide the general public with various electronic information and electronic reading services. Accordingly, electronic information has become the primary source for people to get information, and reading electronic information has become an indispensable part of people's daily life.
Generally speaking, when a user who is using the electronic reading service (e.g., viewing an E-book or a website) encounters a new or an interesting word, he may desire to know the meaning, basic information or other related derivative information of the word. To cater for this demand, services such as named entity marking and automatic link searching for electronic information have emerged.
Conventionally, most of automatic named entity marking technologies filter specific word strings (e.g., a name of a person, a geographical name, or a proper noun) in an electronic document according to frequencies of the specific word strings and then mark these filtered word strings by labeling their category, description, explanation, or other related information of the named entity. For example, word strings that are often used in an Internet search engines are adopted as a basis for marking a named entity by conventional technologies. Some other technologies employ a tokenization technology or a tokenizer in conjunction with a word library with parts of speech and a syntax tree to tokenize a sentence according to the frequency to generate a tokenization result (e.g., extracting one or more named entities noted with parts of speech) for the named entity marking. However, these conventional named entity marking technologies are usually based on only the frequency of appearance but don't take the category of the named entity into account, so there are deficiencies with these conventional technologies, for example, they fail to determine a named entity to be marked according to contents of a document to be marked and fail to mark a new word with a low frequency of appearance. Therefore, these conventional technologies have drawbacks that they often make wrong markings, mark unrelated words or fail to mark a new word, leading to a poor effect of named entity marking. As an attempt to reduce errors in marking and improve marking accuracy, the conventional named entity marking technologies often resort to manual correction ex post facto, which consumes a lot of human labor and time and makes it impossible to achieve complete automation of named entity marking.
Accordingly, a need exists in the art to provide a named entity marking method that can mark a new word and determine a named entity to be marked according to a document to be marked so that the named entity can be marked in a fully automatic way in the to-be-marked document with an extremely high accuracy without need of manual correction ex post facto.