# Zodiac sign stereotypes in Twitter

In experiments on word usage in Twitter, I’ve constantly noticed some very coherent groups of hashtags and words: those belonging to astrology. Apparently, many users post horoscope information, statements or comments and tag them using the name of the zodiac sign. So, I wondered (since I pretty much tried ignored astrology all my life) what are the most particular traits that people use to describe each sign.

#Taurus is extremely kind and sweet..until you betray them; then death is better.

To uncover this, I planned to use a combination of Twitter data and one of my favourite statistical measures – Pointwise Mutual Information (PMI) [1,2].

The PMI measures how often two words co-occur in the same context compared to chance:

$\text{PMI}(x,y)=\log \frac{P(x,y)}{P(x)(P(y)}$

This measure is particularly popular in NLP (Natural Language Processing) where it has utility in a range of tasks: collocation extraction [3], lexical semantics [4] or sentiment analysis [5]. PMI’s main drawback (in my opinion) is the lack of fixed bounds. Without these bounds, PMI values are biased to overemphasise rare words. To address this issue, Bouma [6] introduced NPMI, which values are between -1 and 1:

$\text{NPMI}(x,y)= -\log P(x,y) \log \frac{P(x,y)}{P(x)(P(y)}$

The probabilities are practically ratios of word occurrence (or co-occurrence in the same tweet) and the total number of tweets. This does not fully solve the problem of rare words, but at least now NPMI is directly comparable for across all different words.

Switching to practical examples, I have downloaded 2 months of Twitter data from early 2011 (a 10% sample) and computed the NPMI values for all pairs of words with the entire tweet as context. That means given the word ‘police’, its top NPMI co-occurring words are: officers (0.4299), officer (0.4153), investigating (0.4000), arrest (0.3826), brutality (0.3685), arrested (0.3458), investigate (0.3175), burglary (0.3148). You can see that we mainly get both words parts of collocations (‘police officer(s)’) as well as semantically related words (‘arrest’, ‘brutality’).

Reverting to our original zodiac sign example, our problem of identifying specific attributes is relatively simple. We compute NPMI between each zodiac hashtag and words occurring in the same tweet. Then we take the top co-occurring words and keep only adjectives. The top words for each sign are (size of words ~ NPMI; colour ~ frequency):

Hashtags are specifically good as ‘pivot’ words because they are generally unambiguous. However, you might have noticed that one sign is missing: ‘#cancer’ – this is because most of the tweets are mentioning the disease.
What do you think, does your zodiac’s word cloud describe you? Do you have any other idea you want to check with NPMI? Please leave a comment below.

### References

1. Fano, R. M. (1961). Transmission of Information: A Statistical Theory of Communications. MIT Press.

2. Church, K. W. & Hanks P. (1990, March). Word association norms, mutual information, and lexicography. In Computational Linguistics 16 (1), pp. 22–29.

3. Pecina, P. & Schlesinger P. (2006). Combining association measures for collocation extraction. In Proceedings of the COLING/ACL, pp. 651-658.

4. Turney, P. D. (2001). Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning, ECML, pp. 491-502.

5. Turney, P. D. (2002, July). Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, ACL, pp. 417-424.

6. Bouma, G. (2009). Normalized (Pointwise) Mutual Information in Collocation Extraction. In Proceedings of the Biennial GSCL Conference.

## About Daniel Preotiuc

Daniel is a Postdoctoral researcher at the University of Pennsylvania. His research is situated at the intersection of Natural Language Processing, Machine Learning and Social Science. His current interests include spatial and temporal learning models for text, user attribute prediction from text and Gaussian Processes, using large user-generated data coming from Social Media. Prior to joining UPenn, Daniel completed his PhD in Natural Language Processing and Machine Learning at the University of Sheffield, UK and was a researcher on the Trendminer EU FP7 project.

## One thought on “Zodiac sign stereotypes in Twitter”

1. Alex says:

A similar study conducted in 2006 at a clinical research institute. This time it was based on public health patient records. The intent was, of course, to illustrate that correlation does not equal causation.

However, once the study was published in the Journal of Clinical Epidemiology, the media caught on like wildfire with several ‘reputable’ outlets reporting that astrology had been redeemed by science.

Needless to say, the institute had a little PR crisis on its hands, but managed to turn it into an opportunity to educate the public, as well as some media outlets.

Here’s a link to the paper: http://cver.upei.ca/files/cver/04_Astrological%20associations%20and%20illness_jce.pdf

Thank you for a very nice post. Looking forward to more.