1. Field of Invention
This invention relates to segmentation and topic identification of a portion of text, or one or more documents that include text.
2. Description of Related Art
In long text documents, such as news articles and magazine articles, a document often discusses multiple topics, and there are few, if any, headers. The ability to segment and identify the topics in a document has various applications, such as in performing high-precision retrieval. Different approaches have been taken. For example, methods for determining the topical content of a document based upon lexical content are described in U.S. Pat. Nos. 5,659,766 and 5,687,364 to Saund et al. Also, for example, methods for accessing relevant documents using global word co-occurrence patterns are described in U.S. Pat. No. 5,675,819 to Schuetze.
One approach to automated document indexing is Probabilistic Latent Semantic Analysis (PLSA), also called Probabilistic Latent Semantic Indexing (PLSI). This approach is described by Hofmann in “Probabilistic Latent Semantic Indexing”, Proceedings of SIGIR '99, pp. 50–57, August 1999, Berkley, Calif., which is incorporated herein by reference in its entirety.
Another technique for subdividing texts into multi-paragraph units representing subtopics is TextTiling. This technique is described in “TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages”, Computational Linguistics, Vol. 23, No. 1, pp. 33–64, 1997, which is incorporated herein by reference in its entirety.
A known method for determining a text's topic structure uses a statistical learning approach. In particular, topics are represented using word clusters and a finite mixture model, called a Stochastic Topic Model (STM), is used to represent a word distribution within a text. In this known method, a text is segmented by detecting significant differences between Stochastic Topic Models and topics are identified using estimations of Stochastic Topic Models. This approach is described in “Topic Analysis Using a Finite Mixture Model”, Li et al., Proceedings of Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 35–44, 2000 and “Topic Analysis Using a Finite Mixture Model”, Li et al., IPSJ SIGNotes Natural Language (NL), 139(009), 2000, each of which is incorporated herein by reference in its entirety.
A related work on segmentation is described in “Latent Semantic Analysis for Text Segmentation”, Choi et al, Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pp 109–117, 2001, which is incorporated herein by reference in its entirety. In their work, Latent Semantic Analysis is used in the computation of inter-sentence similarity and segmentation points are identified using divisive clustering.
Another related work on segmentation is described in “Statistical Models for Text Segmentation”, Beeferman et al., Machine Learning, 34, pp. 177–210, 1999, which is incorporated herein by reference in its entirety. In their work, a rich variety of cue phrases are utilized for segmentation of a stream of data from an audio source, which may be transcribed, into topically coherent stories. Their work is a part of the TDT program, a part of the DARPA TIDES program.