Aspects of the exemplary embodiment relate to extracting information from text and find particular application in connection with a system and method for recognizing personality traits based on short text sequences.
Exploring the relationship between word use and psychometric traits has provided significant insight into aspects of human behavior (Pennebaker, et al., “Psychological aspects of natural language use: Our words, our selves,” Annual Rev. Psychol., 54:547-577, 2003). Different levels of representation of language have been used, such as syntactic, semantic, and higher-order such as the psychologically-derived lexica of the Linguistic Inquiry and Word Count tool (Pennebaker, et al., “The development and psychometric properties of LIWC2015,” LIWC2015 Development Manual, Austin, Tex.: University of Texas at Austin, pp. 1-25, 2015, hereinafter, Pennebaker 2015). The study of personality traits based on analysis of text has often used a bag-of-words (BOW) type of approach in which each word in a vocabulary is associated with a respective feature-based word representation. A representation of a sequence of text can then be built by aggregating the word level representations. An example of this method is the Open Vocabulary approach (Schwartz, et al., “Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach,” PLOS ONE, 8(9), pp. 1-16, 2013, hereinafter, Schwartz 2013). Given a set of personality traits, such as Extroversion, Emotional Stability, Agreeableness, Conscientiousness, and Openness, the sequence representation can be used to predict the personality traits of the author of the text.
One drawback of this bag-of-linguistic-features approach is that considerable effort can be spent on feature engineering. Another is that it relies on an assumption that these features, like the traits to which they relate, are similarly stable: the same language features always indicate the same traits. However, the relationship between language and personality is not consistent across all forms of communication (Nowson, et al., “Look! Who's Talking? Projection of Extraversion Across Different Social Contexts,” Proc. WCPR14, Workshop on Computational Personality Recognition at ACMM (22nd ACM Int'l Conf. on Multimedia), 2014, hereinafter, Nowson 2014). As an example, the use of negative emotion language (in particular, words relating to ‘anger’) is a strong indicator of Extraversion in conversational data, but this is not the case in the medium of video blogging.
Classification of short sequences of text, such as tweets and text messages, tends to be particularly challenging, due to the specific uses of language of the user, driven by the message length limitation of the platforms, e.g., 140 characters in Twitter. For example, abbreviated and made-up words tend to be used, which are not in the vocabulary and therefore lack a word representation.
Early work on computational personality recognition used SVM-based approaches and manipulated lexical and grammatical feature sets (Argamon, et al., “Lexical predictors of personality type,” Proc. 2005 Joint Annual Meeting of the Interface and the Classification Society of North America, pp. 1-16, 2005; Nowson, et al., “The Identity of Bloggers: Openness and gender in personal weblogs,” AAAI Spring Symp., Computational Approaches to Analyzing Weblogs, pp. 163-167, 2006). Data labelled with personality data is sparse (Nowson 2014) and there has been more interest in reporting novel feature sets. For example, surface forms, syntactic features, such as POS tags and dependency relations, analysis of punctuation and emoticon use, use of latent semantic analysis for topic modeling, and the use of external resources such as Linguistic Inquiry and Word Count (LIWC) (Pennebaker 2015) have been investigated. When applied to tweets, however, LIWC requires further cleaning of the data.
Deep-learning based approaches to personality trait recognition have also been investigated. A neural network based approach to personality prediction of users is described in Kalghatgi, et al., “A neural network approach to personality prediction based on the big-five model,” Int'l J. Innovative Research in Advanced Engineering (IJIRAE), 2(8):56-63. 2015. In this model, a Multilayer Perceptron (MLP) takes as input a collection of hand-crafted grammatical and social behavioral features from each user and assigns a label to each of the five personality traits. A Recurrent Neural Network (RNN) based system, exploiting the turn-taking of conversation for personality trait prediction, is described in Su, et al., “Exploiting turn-taking temporal evolution for personality trait perception in dyadic conversations,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, 24(4):733-744, 2016. In this approach, RNNs are employed to model the temporal evolution of dialog, taking as input LIWC-based and grammatical features. The output of the RNNs is then used for the prediction of personality trait scores of the participants of the conversations. Both of these approaches utilize hand-crafted features which rely heavily on domain expertise. Also they focus on the prediction of trait scores at the user level, given all the available text from a user.
For applying deep learning models to NLP problems, word lookup tables may be used, where each word is represented by a dense real-valued vector in a low-dimensional space (Socher, et al., “Recursive deep models for semantic compositionality over a sentiment Treebank,” Proc. EMNLP, pp. 1631-1642, 2013; Kalchbrenner, et al., “A convolutional neural network for modelling sentences,” Proc. ACL, pp. 655-665, 2014; Yoon Kim, “Convolutional neural networks for sentence classification,” Proc. EMNLP, pp. 1746-1751, 2014). In order to obtain a sensible set of embeddings, a large corpus may be used for training the mode; in an unsupervised fashion, e.g., using Word2Vec representations (Mikolov, et al., “Efficient estimation of word representations in vector space,” Int'l Conf. on Learning Representations (ICLR 2013), 2013; Mikolov, et al., “Distributed representations of words and phrases and their compositionality,” Proc. 27th Annual Conf. on Neural Information Processing Systems (NIPS 2013), pp 3111-3119, 2013) and GloVe (Pennington, et al., Glove: Global vectors for word representation,” Proc. 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP 2014), pp. 1532-1543, 2014).
Despite the success in capturing syntactic and semantic information with such word vectors, there are two practical problems with such an approach (Ling, et al., “Finding function in form: Compositional character models for open vocabulary word representation,” Proc. 2015 Conf. on Empirical Methods in Natural Language Processing, pp. 1520-1530, 2015, hereinafter, Ling 2015). First, due to the flexibility of language, previously unseen words are bound to occur regardless of how large the unsupervised training corpus is. The problem is particularly serious for text extracted from social media platforms, such as Twitter and Facebook, due to the noisy nature of user-generated text, e.g., typos, ad hoc acronyms and abbreviations, phonetic substitutions, and even meaningless strings (Han, et al., “Lexical normalisation of short text messages: Makn sens a #twitter,” Proc. 49th Annual Meeting of the ACL: Human Language Technologies, pp. 368-378, 2011). Second, the number of parameters for a model to learn is overwhelmingly large. Assuming that each word is represented by a vector of d dimensions, the total size of the word lookup table is d×|V|, where |V| is the size of the vocabulary, which tends to scale to the order of hundreds and thousands. Again, this problem is even more pronounced in noisier domains, such as short text generated by online users.
A compositional character-to-sequence model is described herein which addresses some of these problems.