### Research Publications

#### Developing Age and Gender Predictive Lexica over Social Media

Demographic lexica have potential for widespread use in social science, economic, and business applications. We derive predictive lexica (words and weights) for age and gender using regression and classification models from word usage in Facebook, blog, and Twitter data with associated demographic labels. The lexica, made publicly available, achieved state-of-the-art accuracy in language based age and gender prediction over Facebook and Twitter, and were evaluated for generalization across social media genres as well as in limited message situations.

$usage_{lex}=\sum_{word\in lex}w_{lex}(word)*\frac{freq(word,doc)}{freq(*,doc)}$
CITATION: Sap, M., Park, G., Eichstaedt, J. E., Kern, M. L., Stillwell, D. J., Kosinski, M., Ungar, L. H., & Schwartz, H. A. (2014). Developing Age and Gender Predictive Lexica over Social Media. In EMNLP.
bibtex
@inproceedings{sap2014developing,
author={Sap, M. and Park, G. and Eichstaedt, J. E. and Kern, M. L. and Stillwell, D. J. and Kosinski, M. and Ungar, L. H. and Schwartz, H. A. },
title={Developing Age and Gender Predictive Lexica over Social Media},
booktitle={EMNLP},
year={2014},

}

#### The Online Social Self: An Open Vocabulary Approach to Personality

Objective: We present a new open language analysis approach that identifies and visually summarizes the dominant naturally occurring words and phrases that most distinguished each Big Five personality trait. Method: Using millions of posts from 69,792 Facebook users, we examined the correlation of personality traits with online word usage. Our analysis method consists of feature extraction, correlational analysis, and visualization. Results: The distinguishing words and phrases were face valid and provide insight into processes that underlie the Big Five traits. Conclusion: Open-ended data driven exploration of large datasets combined with established psychological theory and measures offers new tools to further understand the human psyche.

CITATION: Kern, M. L., Eichstaedt, J. C., Schwartz, H. A., Dziurzynski, L., Ungar, L. H., Stillwell, D. J., Kosinki, M., Ramones, S. M., & Seligman, M. E. P. (2013). The Online Social Self: An Open Vocabulary Approach to Personality. In Assessment.
bibtex
@article{kern2013the,
author={Kern, M. L. and Eichstaedt, J. C. and Schwartz, H. A. and Dziurzynski, L. and Ungar, L. H. and Stillwell, D. J. and Kosinki, M. and Ramones, S. M. and Seligman, M. E. P},
title={The Online Social Self: An Open Vocabulary Approach to Personality},
booktitle={Assessment},
year={2013},

}

#### Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach

We analyzed 700 million words, phrases, and topic instances collected from the Facebook messages of 75,000 volunteers, who also took standard personality tests, and found striking variations in language with personality, gender, and age. In our open-vocabulary technique, the data itself drives a comprehensive exploration of language that distinguishes people, finding connections that are not captured with traditional closed-vocabulary word-category analyses. Our analyses shed new light on psychosocial processes yielding results that are face valid (e.g., subjects living in high elevations talk about the mountains), tie in with other research (e.g., neurotic people disproportionately use the phrase .sick of. and the word .depressed.), suggest new hypotheses (e.g., an active life implies emotional stability), and give detailed insights (males use the possessive .my. when mentioning their .wife. or .girlfriend. more often than females use .my. with .husband. or .boyfriend.). To date, this represents the largest study, by an order of magnitude, of language and personality.

CITATION: Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M. E. P., & Ungar, L. H. (2013). Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. In PLOS ONE 8(9).
bibtex
@article{schwartz2013personality,,
author={Schwartz, H Andrew and Eichstaedt, Johannes C and Kern, Margaret L and Dziurzynski, Lukasz and Ramones, Stephanie M and Agrawal, Megha and Shah, Achal and Kosinski, Michal and Stillwell, David and Seligman, Martin E P and Ungar, Lyle H},
title={Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach},
booktitle={PLOS ONE 8(9)},
year={2013},

}

#### From "sooo excited!!!" to "so proud": Using language to study development

We introduce a new method, differential language analysis (DLA), for studying human development that uses computational linguistics to analyze the big data available through online social media in light of psychological theory. Our open vocabulary DLA approach finds words, phrases, and topics that distinguish groups of people based on one or more characteristics. Using a dataset of over 70,000 Facebook users, we identify how word and topic use vary as a function of age, and compile cohort specific words and phrases into visual summaries that are face valid and intuitively meaningful. We demonstrate how this methodology can be used to test developmental hypotheses, using the aging positivity effect (Carstensen & Mikels, 2005) as an example. While this study focuses primarily on common trends across age-related cohorts, the same methodology can be used to explore heterogeneity within developmental stages or to explore other characteristics that differentiate groups of people. Our comprehensive list of words and topics are available on our website for deeper exploration by the research community.

CITATION: Kern, M. L., Eichstaedt, J. C., Schwartz, H. A., Park, G., Ungar, L. H., Stillwell, D. J., Kosinski, M., Dziurzynski, L., & Seligman, M. E. P. (2013). From "sooo excited!!!" to "so proud": Using language to study development. In Developmental Psychology.
bibtex
@article{kern2013from,
author={Kern, M. L. and Eichstaedt, J. C. and Schwartz, H. A. and Park, G. and Ungar, L. H. and Stillwell, D. J. and Kosinski, M. and Dziurzynski, L. and Seligman, M. E. P.},
title={From "sooo excited!!!" to "so proud": Using language to study development},
booktitle={Developmental Psychology},
year={2013},

}

#### Characterizing Geographic Variation in Well-Being using Tweets

The language used in tweets from 1,300 different US counties was found to be predictive of the subjective well-being of people living in those counties as measured by representative surveys. Topics, sets of co-occurring words derived from the tweets using LDA, improved accuracy in predicting life satisfaction over and above standard demographic and socio-economic controls (age, gender, ethnicity, income, and education). The LDA topics provide a greater behavioural and conceptual resolution into life satisfaction than the broad socio-economic and demographic variables. For example, tied in with the psychological literature, words relating to outdoor activities, spiritual meaning, exercise, and good jobs correlate with increased life satisfaction, while words signifying disengagement like bored’ and tired’ show a negative association

CITATION: Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Lucas, R. E., Agrawal, M., Park, G. J., Lakshmikanth, S. K., Jha, S., Seligman, M. E. P., & Ungar, L. H. (2013). Characterizing Geographic Variation in Well-Being using Tweets. In Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (ICWSM). Boston, MA.
bibtex
@inproceedings{schwartz2013characterizing,
author={Schwartz, H Andrew and Eichstaedt, Johannes C and Kern, Margaret L and Dziurzynski, Lukasz and Lucas, Richard E and Agrawal, Megha and Park, Gregory J and Lakshmikanth, Shrinidhi K and Jha, Sneha and Seligman, Martin E P and Ungar, Lyle H},
title={Characterizing Geographic Variation in Well-Being using Tweets},
booktitle={Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (ICWSM)},
year={2013},
location={Boston, MA},

}

#### Toward Personality Insights from Language Exploration in Social Media

Language in social media reveals a lot about people’s personality and mood as they discuss the activities and relationships that constitute their everyday lives. Although social media are widely studied, researchers in computational linguistics have mostly focused on prediction tasks such as sentiment analysis and authorship attribution. In this paper, we show how social media can also be used to gain psychological insights. We demonstrate an exploration of language use as a function of age, gender, and personality from a dataset of Facebook posts from 75,000 people who have also taken person- ality tests, and we suggest how more sophisticated tools could be brought to bear on such data.

CITATION: Schwartz, H. A., Eichstaedt, J. C., Dziurzynski, L., Kern, M. L., Blanco, E., Kosinski, M., Stillwell, D., Seligman, M. E. P., & Ungar, L. H. (2013). Toward Personality Insights from Language Exploration in Social Media. In Proceedings of the AAAI Spring Symposium Series. Stanford, CA.
bibtex
@inproceedings{schwartz2013toward,
author={Schwartz, H Andrew and Eichstaedt, Johannes C and Dziurzynski, Lukasz and Kern, Margaret L and Blanco, Eduardo and Kosinski, Michal and Stillwell, David and Seligman, Martin E P and Ungar, Lyle H},
title={Toward Personality Insights from Language Exploration in Social Media},
booktitle={Proceedings of the AAAI Spring Symposium Series},
year={2013},
location={Stanford, CA},

}

#### Choosing the Right Words: Characterizing and Reducing Error of the Word Count Approach

Social scientists are increasingly using the vast amount of text available on social media to measure variation in happiness and other psychological states. Such studies count words deemed to be indicators of happiness and track how the word frequencies change across locations or time. This word count approach is simple and scalable, yet often picks up false signals, as words can appear in different contexts and take on different meanings. We characterize the types of errors that occur using the word count approach, and find lexical ambiguity to be the most prevalent. We then show that one can reduce error with a simple refinement to such lexica by automatically eliminating highly ambiguous words. The resulting refined lexica improve precision as measured by human judgments of word occurrences in Facebook posts.

CITATION: Schwartz, H. A., Eichstaedt, J. C., Dziurzynski, L., Kern, M. L., Blanco, E., Ramones, S., Seligman, M. E. P., & Ungar, L. H. (2013). Choosing the Right Words: Characterizing and Reducing Error of the Word Count Approach. In Proceedings of *SEM-2013: Second Joint Conference on Lexical and Computational Semantics.
bibtex
@inproceedings{schwartz2013choosing,
author={Schwartz, H Andrew and Eichstaedt, Johannes C and Dziurzynski, Lukasz and Kern, Margaret L and Blanco, Eduardo and Ramones, Stephanie and Seligman, Martin E P and Ungar, Lyle H},
title={Choosing the Right Words: Characterizing and Reducing Error of the Word Count Approach},
booktitle={Proceedings of *SEM-2013: Second Joint Conference on Lexical and Computational Semantics},
year={2013},

}