Field
The disclosed embodiments relate to topic mining. More specifically, the disclosed embodiments relate to techniques for performing topic extraction using clause segmentation and high-frequency words.
Related Art
Topic mining techniques may be used to discover abstract topics or themes in a collection of otherwise unstructured documents. The discovered topics or themes may be used to identify concepts or ideas expressed in the documents, group the documents by topic or theme, determine sentiments and/or attitudes associated with the documents, and/or generate summaries associated with the topics or themes. In other words, topic mining may facilitate the understanding and use of information in large sets of unstructured data without requiring manual review of the data.
Sentiment analysis may also be applied to documents to determine the overall sentiment, attitude, and/or polarity of the documents' creators. For example, individual words or sentences of a document may be analyzed to determine if the opinions expressed in the document are positive, negative, or neutral. Sentiment scores associated with the words or sentences may then be combined to label the overall sentiment of the document as positive, negative, or neutral.
Topic mining techniques typically utilize metrics and/or statistical models to group document collections into distinct topics and themes. For example, topics may be generated from a set of documents using metrics such as term frequency-inverse document frequency (tf-idf), co-occurrence, and/or mutual information. Alternatively, statistical topic models, such as probabilistic latent semantic indexing (PLSI), latent Dirichlet allocation (LDA), and/or correlated topic models (CTMs), may be used to discover topics from a document collection and assign the topics to documents in the document collection.
However, existing topic mining and sentiment analysis techniques are associated with a number of drawbacks. First, the use of metrics such as tf-idf to identify potential topics may be computationally efficient but may produce a large number of topics with significant overlap. On the other hand, the use of statistical topic models may require significant amounts of training data and/or computational overhead to extract topics from a set of documents.
Second, conventional sentiment analysis techniques may assign an overall sentiment to a document or topic in the document when the document contains multiple sentiments and/or topics. Moreover, sentiment analysis systems may rely on structured data sets such as product reviews and typically do not adapt well to new domains and/or noisy data sets such as social media.
Consequently, processing of large sets of unstructured data may be facilitated by mechanisms for improving the efficiency and/or accuracy of techniques for mining topics and/or identifying sentiments in the unstructured data.
In the figures, like reference numerals refer to the same figure elements.