RESOURCES
An online community space dedicated to supporting the field of language analysis for social science.
Explore thousands of topics captured in over 14 million Facebook status updates derived via LDA.
What words and phrases are characteristic of age, gender, and the Big Five personality traits?
A lexicon is a listing of words centered upon a particular topic of interest. Click here to download available lexica.
Software (written in Python) available for free download under the Creative Commons licensing.
Our end-to-end text analysis package specifically suited for social media and social scientific research.
Additional data and information useful for research purposes.
Facebook Topics

You might also use...
  • Conditional probabilities [.csv] (sparse matrix format)

Click here to walk-through an example

The goal is to get the probability of a topic given the document:



For example, let's say we have topics with the following condition probabilities for words:

    topic 1: a: 0.01, b: 0.02, c: 0.001
    topic 2: c: 0.02, d: 0.005

and two documents with the following frequencies of words:

    document 1: a: 2, b: 10, c: 3, d: 0, e: 6, f: 4
    document 2: a: 5, b: 3, c: 8, d: 4, e: 0, f: 10

therefore the total word use in the documents are:

    document 1: 2 + 10 + 3 + 0 + 6 + 4 = 25
    document 2: 5 + 3 + 8 + 4 + 0 + 10 = 30

document 1's use of topics is given by summing the weighted relative frequencies:

    p(topic1|document1): (2/25)*0.01 + (10/25)*0.02 + (3/25)*0.001 = 0.00892
    p(topic2|document1): (3/25)*0.02 + (0/25)*0.005 = 0.0024

while document 2's use of topics:

    p(topic1|document2): (5/30)*0.01 + (3/30)*0.02 + (8/30)*0.001 = .00393
    p(topic2|document2): (8/30)*0.02 + (4/30)*0.005 = 0.006

Link to Publication
APA Citation Bibtex Citation
Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M. E., & Ungar, L. H. (2013). Personality, gender, and age in the language of social media: The Open-Vocabulary Approach. PLOS ONE, 8(9), . . e73791.
@article{schwartz2013personality,,
author={Schwartz, H Andrew and Eichstaedt, Johannes C and Kern, Margaret L and Dziurzynski, Lukasz and Ramones, Stephanie M and Agrawal, Megha and Shah, Achal and Kosinski, Michal and Stillwell, David and Seligman, Martin EP and Ungar, Lyle H},
title={Personality, gender, and age in the language of social media: The Open-Vocabulary Approach},
journal={PLOS ONE},
year={2013},
pages={e73791}
}
Word and Phrase Correlations

Link to Publication
APA Citation Bibtex Citation
Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M. E., & Ungar, L. H. (2013). Personality, gender, and age in the language of social media: The Open-Vocabulary Approach. PLOS ONE, 8(9), . . e73791.
@article{schwartz2013personality,,
author={Schwartz, H Andrew and Eichstaedt, Johannes C and Kern, Margaret L and Dziurzynski, Lukasz and Ramones, Stephanie M and Agrawal, Megha and Shah, Achal and Kosinski, Michal and Stillwell, David and Seligman, Martin EP and Ungar, Lyle H},
title={Personality, gender, and age in the language of social media: The Open-Vocabulary Approach},
journal={PLOS ONE},
year={2013},
pages={e73791}
}
Software

  • Our slightly improved version of Christopher Potts' Happy Fun Tokenizer (shared with his permission). [happierfuntokenizing.zip]

  • Our Differential Language Analysis Toolkit. Interacts with mysql data, performs feature extraction and statistical analyses. Available at GitHub. More information can be found at dlatk.wwbp.org.

These works are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License and a GNU General Public License v3 (GPLv3).
Lexica

  • Age and Gender Lexica
    Our data-driven age and gender lexica were generated from about 97,000 Facebook, Blogger and Twitter users. [.zip]
    Link to Publication
    APA Citation Bibtex Citation
    Sap, M., Park, G., Eichstaedt, J. C., Kern, M. L., Stillwell, D. J., Kosinski, M., Ungar, L. H., & Schwartz, H. A. (2014). Developing age and gender predictive lexica over social media. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), (), . . .
    @inproceedings{sap2014developing,
    author={Sap, Maarten and Park, Greg and Eichstaedt, Johannes C and Kern, Margaret L and Stillwell, David J and Kosinski, Michal and Ungar, Lyle H and Schwartz, H Andrew},
    title={Developing age and gender predictive lexica over social media},
    booktitle={Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    year={2014},
    }

  • Refined Lexica
    Please email Andrew Schwartz to request a refined lexica such as LIWC. This may be posted shortly.
    Link to Publication
    APA Citation Bibtex Citation
    Schwartz, H. A., Eichstaedt, J. C., Dziurzynski, L., Kern, M. L., Blanco, E., Ramones, S., Seligman, M. E. P., & Ungar, L. H. (2013). Choosing the right words: Characterizing and reducing error of the Word Count Approach.. Proceedings of *SEM-2013: Second Joint Conference on Lexical and Computational Semantics, Atlanta, Georgia, USA, (), . Atlanta, GA, USA. 296-305.
    @inproceedings{schwartz2013choosing,
    author={Schwartz, H Andrew and Eichstaedt, Johannes C and Dziurzynski, Lukasz and Kern, Margaret L and Blanco, Eduardo and Ramones, Stephanie and Seligman, Martin E P and Ungar, Lyle H},
    title={Choosing the right words: Characterizing and reducing error of the Word Count Approach.},
    booktitle={Proceedings of *SEM-2013: Second Joint Conference on Lexical and Computational Semantics, Atlanta, Georgia, USA},
    year={2013},
    location={Atlanta, GA, USA},
    pages={296-305}
    }

  • PERMA Lexicon
    Our lexicon to predict well-being as measured through PERMA scales. [.zip] [Usage license]
    Link to Publication
    APA Citation Bibtex Citation
    & H. Andrew Schwartz, M. S. (2016). Predicting Individual Well-Being Through the Language of Social Media. Pacific Symposium on Biocomputing, 21(), . . 516-527.
    @{h. andrew schwartz2016predicting,
    author={H. Andrew Schwartz, Maarten Sap, Margaret L. Kern, Johannes C. Eichstaedt, Adam Kapelner, Megha Agrawal, Eduardo Blanco, Lukasz Dziurzynski, Gregory Park, David Stillwell, Michal Kosinski, Martin E.P. Seligman, Lyle H. Ungar.},
    title={Predicting Individual Well-Being Through the Language of Social Media},
    year={2016},
    pages={516-527}
    }

  • Spanish PERMA Lexicon
    Our lexicon to measure PERMA in Spanish, derived from Spanish tweets annotated with PERMA. [.zip]
    Link to Publication
    APA Citation Bibtex Citation
    & H. Andrew Schwartz, M. S. (2016). Predicting Individual Well-Being Through the Language of Social Media. Pacific Symposium on Biocomputing, 21(), . . 516-527.
    @{h. andrew schwartz2016predicting,
    author={H. Andrew Schwartz, Maarten Sap, Margaret L. Kern, Johannes C. Eichstaedt, Adam Kapelner, Megha Agrawal, Eduardo Blanco, Lukasz Dziurzynski, Gregory Park, David Stillwell, Michal Kosinski, Martin E.P. Seligman, Lyle H. Ungar.},
    title={Predicting Individual Well-Being Through the Language of Social Media},
    year={2016},
    pages={516-527}
    }

  • Prospection Lexicon: Temporal Orientation

  • Affect and Intensity Lexicon

  • Optimism Lexicon

  • Additional Lexica
    Please visit LexHub for more resources.

  • When Applying the Lexica

    To calculate the lexicon usage, one can take the sum over all words of the word weight in that particular lexicon multiplied by that word's relative frequency, and subsequently adding the intercept value to correct for the model bias (found under '_intercept' in the lexicon csvs).

    Click here for a walk-through example

    A weighted lexicon is often applied as the sum of all weighted word relative frequencies over a document:


    where is the lexicon weight for the word, is frequency of the word in the document (or for a given user), and is the total word count for that document (or user).

    For example, let's say a lexicon has the following weights for words a, b, and c:

    and two documents with the following frequencies of words:

    therefore the total word uses in the documents are:

    The documents' lexicon usage are given by summing the weighted relative frequencies:


    Once the usages have been computed, the intercept of the lexicon needs to be added to the usages:



    If the lexicon used represents age, and are the predicted ages for both documents. If it represents gender, simply take the sign of the result and if it's positive, the document is female, else it's male.

Additional Data Sets

  • [ County HIV Prevalence Topic Tagcloud Data ]
    (controlled for density and ethnicity).
    Link to Publication
    APA Citation Bibtex Citation
    & H. Andrew Schwartz, M. S. (2016). Predicting Individual Well-Being Through the Language of Social Media. Pacific Symposium on Biocomputing, 21(), . . 516-527.
    @{h. andrew schwartz2016predicting,
    author={H. Andrew Schwartz, Maarten Sap, Margaret L. Kern, Johannes C. Eichstaedt, Adam Kapelner, Megha Agrawal, Eduardo Blanco, Lukasz Dziurzynski, Gregory Park, David Stillwell, Michal Kosinski, Martin E.P. Seligman, Lyle H. Ungar.},
    title={Predicting Individual Well-Being Through the Language of Social Media},
    year={2016},
    pages={516-527}
    }
    Ireland, M. E., Schwartz, H. A., Chen, Q., Ungar, L. H., & AlbarracĂ­n, D. (2015). Future-oriented tweets predict lower county-level HIV prevalence in the United States. Health Psychology, 34(S), 1252.

  • [County Heart Health Data]
    Eichstaedt, J. C., Schwartz, H. A., Kern, M. L., Park, G., Labarthe, D. R., Merchant, R. M., Jha, S., Agrawal, M., Dziurzynski, L. A., Sap, M., Weeg, C., Larson, E. E., Ungar, L. H., & Seligman, M. E. (2015). Psychological Language on Twitter Predicts County-Level Heart Disease Mortality. Psychological Science 26(2), 159-169.

  • [Temporal Orientation Result Sets]
    Schwartz, H. Andrew, Park, G., Sap, M., Weingarten, E., Eichstaedt, J., Kern, M., Stillwell, D., Kosinski, M., Berger, J., Seligman, M., & Ungar, L. (2015). Extracting Human Temporal Orientation from Facebook Language. NAACL-2015: Conference of the North American Chapter of the Association for Computational Linguistics.

  • [Valence and Arousal Facebook Posts]
    Preotiuc-Pietro, D., Schwartz, H.A., Park, G., Eichstaedt, J., Kern, M., Ungar, L., Shulman, E.P. (2016). Modelling Valence and Arousal in Facebook Posts. Proceedings of the Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA), NAACL.

  • Please visit LexHub for more available Data Sets.