In experiments on word usage in Twitter, I’ve constantly noticed some very coherent groups of hashtags and words: those belonging to astrology. Apparently, many users post horoscope information, statements or comments and tag them using the name of the zodiac sign. So, I wondered (since I pretty much tried ignored astrology all my life) what are the most particular traits that people use to describe each sign.
#Taurus is extremely kind and sweet..until you betray them; then death is better.
To uncover this, I planned to use a combination of Twitter data and one of my favourite statistical measures – Pointwise Mutual Information (PMI) [1,2].
The PMI measures how often two words co-occur in the same context compared to chance:
This measure is particularly popular in NLP (Natural Language Processing) where it has utility in a range of tasks: collocation extraction , lexical semantics  or sentiment analysis . PMI’s main drawback (in my opinion) is the lack of fixed bounds. Without these bounds, PMI values are biased to overemphasise rare words. To address this issue, Bouma  introduced NPMI, which values are between -1 and 1:
The probabilities are practically ratios of word occurrence (or co-occurrence in the same tweet) and the total number of tweets. This does not fully solve the problem of rare words, but at least now NPMI is directly comparable for across all different words.
Switching to practical examples, I have downloaded 2 months of Twitter data from early 2011 (a 10% sample) and computed the NPMI values for all pairs of words with the entire tweet as context. That means given the word ‘police’, its top NPMI co-occurring words are: officers (0.4299), officer (0.4153), investigating (0.4000), arrest (0.3826), brutality (0.3685), arrested (0.3458), investigate (0.3175), burglary (0.3148). You can see that we mainly get both words parts of collocations (‘police officer(s)’) as well as semantically related words (‘arrest’, ‘brutality’).
Reverting to our original zodiac sign example, our problem of identifying specific attributes is relatively simple. We compute NPMI between each zodiac hashtag and words occurring in the same tweet. Then we take the top co-occurring words and keep only adjectives. The top words for each sign are (size of words ~ NPMI; colour ~ frequency):
Hashtags are specifically good as ‘pivot’ words because they are generally unambiguous. However, you might have noticed that one sign is missing: ‘#cancer’ – this is because most of the tweets are mentioning the disease.
What do you think, does your zodiac’s word cloud describe you? Do you have any other idea you want to check with NPMI? Please leave a comment below.
Fano, R. M. (1961). Transmission of Information: A Statistical Theory of Communications. MIT Press.
Church, K. W. & Hanks P. (1990, March). Word association norms, mutual information, and lexicography. In Computational Linguistics 16 (1), pp. 22–29.
Pecina, P. & Schlesinger P. (2006). Combining association measures for collocation extraction. In Proceedings of the COLING/ACL, pp. 651-658.
Turney, P. D. (2001). Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning, ECML, pp. 491-502.
Turney, P. D. (2002, July). Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, ACL, pp. 417-424.
Bouma, G. (2009). Normalized (Pointwise) Mutual Information in Collocation Extraction. In Proceedings of the Biennial GSCL Conference.