When reading an electronic or conventional book, a reader often encounters interesting or strange terms that he or she wants to have more knowledge about, in addition to what the book itself presents. Mostly likely, the knowledge is readily available on the Internet. For example, online encyclopedia databases, such as Wikipedia, are popular resources that contain a very large amount of well-organized information that covers almost every conceivable subject matter. Conventionally, the reader can find a computing device connected to the Internet, open an internet browser to visit Wikipedia, and then submit his or her search term to get the relevant information on the book term. The reader may find the process cumbersome and interruptive and so give up the intention for a deep dive experience.
“Wikification” refers to the task of automatically linking text-based content to Wikipedia entries corresponding to terms mentioned in the text. Common terms of interest are people, places, organizations and similar categories. Typically a Wikification process involves implementation of two primary steps: (1) detection of suitable candidate terms that are potentially interesting to a user, and (2) disambiguation of some candidate terms that may match to several Wikipedia entries. For instance, depending on the context, the term “Chicago” can mean the city, the musical movie, and as many as 80 or so additional definitions currently listed in the Wikipedia disambiguation page for “Chicago.” Conventionally, most systems solve the disambiguation problem by analyzing the raw context surrounding the candidate term in order to determine which of the matching titles is the most relevant to the context, and therefore, presumably, to the term itself. This approach may not be efficient in locating the correct match.
In addition, most of the existing efforts of wikification are directed to analysis and tagging of raw text in a website, scientific articles, and other relatively short text excerpts. The application of wikification on large volumes of text corpus such as books has been limited.