Medical conditions are predictable from social media posts
200 topics and their loadings across 37 medical conditions.
Top 10 words in each topic, order by descending conditional probability
Citation and relavant information for data files.
Facebook Topics

You might also use...
  • Conditional probabilities [.csv] (sparse matrix format)

Click here to walk-through an example

The goal is to get the probability of a topic given the document:



For example, let's say we have topics with the following condition probabilities for words:

    topic 1: a: 0.01, b: 0.02, c: 0.001
    topic 2: c: 0.02, d: 0.005

and two documents with the following frequencies of words:

    document 1: a: 2, b: 10, c: 3, d: 0, e: 6, f: 4
    document 2: a: 5, b: 3, c: 8, d: 4, e: 0, f: 10

therefore the total word use in the documents are:

    document 1: 2 + 10 + 3 + 0 + 6 + 4 = 25
    document 2: 5 + 3 + 8 + 4 + 0 + 10 = 30

document 1's use of topics is given by summing the weighted relative frequencies:

    p(topic1|document1): (2/25)*0.01 + (10/25)*0.02 + (3/25)*0.001 = 0.00892
    p(topic2|document1): (3/25)*0.02 + (0/25)*0.005 = 0.0024

while document 2's use of topics:

    p(topic1|document2): (5/30)*0.01 + (3/30)*0.02 + (8/30)*0.001 = .00393
    p(topic2|document2): (8/30)*0.02 + (4/30)*0.005 = 0.006

Link to Publication
APA Citation Bibtex Citation