1. Field of Art
The invention generally relates to the field of textual analysis and in particular to determining characteristics of a book by analyzing text of the book on a section-by-section basis.
2. Background Information
A temporal language model can calculate the probability that a sequence of m words P(w1, . . . , wm) was written during a particular time period. This probability can be represented as P(timePeriod|text), where text is the sequence of m words (w1, . . . , wm) and timePeriod is the particular time period (e.g., the 1950s). An example application of temporal language models is the dating of texts. Given a date-tagged reference corpus (consisting of documents from a particular time period) and a document X with unknown date (within the same time period), a text-dating system can classify X according to time partitions of predefined granularity (e.g., decades). Temporal language models derived from the corpus capture characteristics of the vocabulary used within particular time periods. A language model is computed from the undated document X and is compared to the temporal language models built from the reference corpus.
Text-dating systems are often used to analyze short documents, such as newspaper articles and web pages. These types of documents usually contain homogeneous language (i.e., language that was written during the same time period). Since the language is homogeneous, the choice of which portion of the document to analyze is usually irrelevant. Long documents, such as entire books, might contain language that is less homogeneous. If a text-dating system is used to analyze such a document, then text from different portions of the document might yield different results, and the choice of which portion of the document to analyze becomes important.