Differential Language Analysis for Construct Validation: Do People Differ by Astrological Sign?

Many Americans find astrology quite convincing. In fact, approximately 25% of Americans believe in Astrology, 55% of 18 to 24 year olds think astrology is at least “sort of scientific”, and a Huffington Post article on the Zodiac signs of world leaders, just released, has already accrued thousands of Facebook likes. My colleague Daniel Preotiuc-Pietro recently examined stereotypical words that accompany these beliefs by analyzing the content of tweets containing astrological sign hashtags. For example, my star sign, #leo, was most distinguished by words like “loyal”, “dynamic”, “stubborn”, “generous”, and “affectionate”. However, do leos actually differ in this way? Do people differ by their star sign at all?

I started investigating this question back in 2012 while working on what eventually became our first PLOS ONE paper: Personality, Gender, and Age in the Language of Social Media. My collaborators and I were thinking of applications for differential language analysis (DLA), our method which finds language features (e.g. words or phrases) that distinguish psychological and other human attributes. The suggestion was made that DLA could be used for psychological construct validation (i.e. does the language emerging from DLA fit the theory of the construct? How many language features, total, emerge as significantly correlated? For example, do leos use words indicating they are more stubborn or generous? How many words correlate with being a leo?).

Astrological (Zodiac) signs are a great way to prototype DLA for construct validation. Such signs seem to correspond with enduring traits that distinguish people. To believers, such signs are akin to a non-evidence-based Big 5 Personality Model, the most widely used model in Psychology. The Universal Psychic Guild represents this position:

The signs of the Zodiac can give us great insights into our day to day living as well as the many talents and special qualities we posses.

Daniel showed that differences clearly exist in descriptions of star signs. Here, we investigate whether differences clearly exist between people according to their star signs.

DLA on the Big Five and the Zodiac Signs

The gist of our method is to find the number of words and phrases that distinguish members of each of the 12 astrological signs and compare this number to what we get for the Big 5. First, we had to make some adjustments to make it a fair comparison; the Big 5 are continuous variables (ranging on a spectrum from very low to very high), whereas the zodiac star signs are binary traits (you are either a leo or not). So we turned the Big 5 variables into binary values: those scoring at least 1.33 standard deviations above the mean were labeled as “high trait” (e.g. high extraversion) and the rest as not. Our sample includes 34,000 Facebook users for whom we had birth dates (in order to label their star sign) and Big 5 personality scores. We run the test (age and gender adjusted; controlled for multiple comparisons) over the 31,969 features that were mentioned by at least once by 1% (340) of our Facebook users. More details on our method and data can be found in “The Nitty Gritty” at the end of the post.


Number of features significantly correlated with each of the Big Five personality traits (left, green) and each of the Zodiac star signs (right, blue). The y-axis is log scaled.
Number of features significantly correlated with each of the Big Five personality traits (left, green) and each of the Zodiac star signs (right, blue). The y-axis is log scaled.

Results. The plot above shows the total number of words and phrases, out of a possible 31,969, that were significantly correlated (p<.05) with each of the Big 5 traits (left) and each of the 12 zodiac signs (right). On the Big 5 side, high openness had the largest total number of correlates with 409. On the Zodiac side, Sagittarius had the largest number of correlates with 4. The y-axis is exponentially scaled or one would not see the Zodiac bars.

The difference in the number of significantly correlated features between astrological sign and the Big 5 is orders of magnitude. Differential word clouds below put this in greater perspective. They show the top 100 most correlated words and phrases for each of (a) the top 2 most distinguished personality traits (high openness and high conscientiousness) and (b) the 2 most distinguished star signs (Sagittarius and Leo). While the clouds for the five factor traits show a diverse set of language trends (e.g. extraverts talk much more about socializing: `party’, `boys’, `girls’, `call me’, `miss you’), it is difficult to find prevailing trends in the astrological results (e.g. Leos are more likely to mention that they are leos, had lunch, and what something they just said means as in “…that means I have to work” or “…that means a hell of a lot”).


Differential word clouds for the Big Five traits openness and extraversion (left) and the Zodiac star signs Sagittarius and Leo (right). The size of the word indicates its correlation strength (only showing those significantly correlated at p), while the color indicates overall frequency of the word (grey is low, blue is medium, and red is high frequency).
Differential word clouds for the Big Five traits openness and extraversion (left) and the Zodiac star signs Sagittarius and Leo (right). The size of the word indicates its correlation strength, while the color indicates overall frequency (from grey to blue to red). All results significant at p<.05.

For the Zodiac signs, we don’t see any patterns generalized across multiple linguistic features like we do with artistic interests and openness or socializing and extraversion (to name just a couple examples). Still, that anything besides `leo’ came out is intriguing. Perhaps `had lunch’ and `to think about’ are spurious result; At the least, I had expected some results due to some social selections in life being dependent on one’s birth month (e.g. professional NHL hockey players are much more likely to have been born in January through March). A future investigation?

Conclusion. Although no obvious language patterns emerge for the astrological signs, these results don’t prove with certainty that they aren’t capable of generalizing who we are. There may be psychological and behavioral factors not accounted for in the language usage patterns and our statistical significance tests aren’t perfect for dichotomous data and they only estimate the probability that a relationship does exist — not whether the relationship doesn’t exist. Regardless, the difference between 409 and 4 presents quite a convincing case that astrological signs do not effectively distinguish people by personality — a case for which this Leo will remain `loyal’!



The Nitty Gritty. We used differential language analysis to find the language most correlated with a given outcome (i.e. personality) controlling for gender and age. Specifically, we looked at 31,969 words and phrases (sequences of 1 to 3 words), which were mentioned by at least 1% of our sample and which occurred together more often than one would expect from chance (i.e. having a point-wise mutual information value >= 3.0). We used the wonderful dataset of our collaborators at the MyPersonality project, from which we had birth dates for 46,092 users which were easily translated into their star sign according to the Zodiac calendar. These users had also written at least 1000 words in status updates on Facebook from July 2009 through June 2011 (so that we had a good sample of their language use), had scores for all five personality traits, and provided their age and gender (so that we could control our results for trivial gender and age distinctions). We controlled for gender by randomly selecting 17,000 males and 17,000 females while we controlled for age as a covariate in a standardized regression. We discretized the five factor traits at a threshold of 1.33 standard deviations, which yielded an average of 3,014 positives per trait, in line with the 2,833 which are positive for each of the 12 star signs. Significance values were adjusted for false discovery using the Benjamini-Hochberg procedure.

My first run of the analysis yielded an intriguing result: a few of the star signs correlated with phrases like “thanks for” and “the birthday wishes”. Why would certain signs be more likely to thank others for their birthday wishes? After entertaining a theory that people are more likely to thank others for birthday wishes in months with fewer holidays, I realized over the course of 2009 to 2011 Facebook was growing rapidly and those born in months closer to June (the end of our dataset) would benefit from more activity around their birthdays. I tested this by controlling for the activity of Facebook on one’s birthday (i.e. the total number of posts made by anyone in the dataset on each month/day). After this, the birthday wishes result went away. This reminds me why many data scientists preach to always look at the data and ask if it really makes sense before drawing conclusions. It’s also an example of where the transparency of an open-vocabulary allows one to avoid spurious results (had we just used, say, a gratitude lexicon, it would not have been clear we were just capturing gratitude toward birthday wishes which was clearly confounded with variance we didn’t want in the analysis).

I’m now going to take evasive action to avoid this blog post being too long: for further details, please see our PLOS ONE paper; besides the data sample and outcomes being different, we followed the same steps. Comments and questions are welcome below.

Thank you to David B. Yaden, Daniel Preotiuc-Pietro, Patrick Crutchley, Laura Smith, Gregory Park, Lyle Ungar, and Margaret Kern for their valuable feedback and help creating this post.

Share this on ...Share on Facebook0Tweet about this on Twitter0Share on Google+0Print this pageEmail this to someone

About Andy Schwartz

Andy is a Visiting Assistant Professor of Computer & Information Science at the University of Pennsylvania and will begin as an Assistant Professor at Stony Brook University (SUNY) in the Fall of 2015. His interdisciplinary research, often utilizing natural language processing and machine learning techniques, focuses on large-scale analyses to discover new behavioral and psychological factors of health and well-being as manifest through social media. He received his PhD in Computer Science from the University of Central Florida in 2011. His recent work has been featured in The Atlantic and The Washington Post. Follow @HAndySchwartz

Leave a Reply

Your email address will not be published. Required fields are marked *