This specification relates to digital data processing, and in particular, to computer-implemented category-sensitive ranking for text.
Automatic summarization is the generation of a summary of a text by a computer process, e.g., a text summarization service. Text summarization services rank words or sentences of textual data, e.g., text on a webpage, to identify portions of the textual data that can be extracted and included in a summary of the textual data. In some situations, textual data can be associated with a topic. A particular word in the textual data can be ranked according to the expression P(z|x)P(x|z), where z is a topic and x is a word. The expression represents the probability of topic z being associated with the textual data given that word x occurs in the textual data multiplied by the probability of word x occurring in the textual data given that topic z is associated with the textual data.
Some text summarization services use topics that are not human-readable, e.g., topics consisting of a combination of words or characters that do not have semantic meaning in natural human language. These topics may not provide insight into the semantic meanings of the words and sentences in the textual data. The meanings of the words and sentences can be relevant to generating a summary of the textual data.