Topic Modeling the Sarah Palin Emails

27 Jun

LDA-based Email Browser

Earlier this month, several thousand emails from Sarah Palin’s time as governor of Alaska were released. The emails weren’t organized in any fashion, though, so to make them easier to browse, I’ve been working on some topic modeling (in particular, using latent Dirichlet allocation) to separate the documents into different groups.

I threw up a simple demo app to view the organized documents here.

What is Latent Dirichlet Allocation?

Briefly, given a set of documents, LDA tries to learn the latent topics underlying the set. It represents each document as a mixture of topics (generated from a Dirichlet distribution), each of which emits words with a certain probability.

For example, given the sentence “I listened to Justin Bieber and Lady Gaga on the radio while driving around in my car”, an LDA model might represent this sentence as 75% about music (a topic which, say, emits the words Bieber with 10% probability, Gaga with 5% probability, radio with 1% probability, and so on) and 25% about cars (which might emit driving with 15% probability and cars with 10% probability).

If you’re familiar with latent semantic analysis, you can think of LDA as a generative version.

(For a more in-depth explanation, I wrote an introduction to LDA here.)

Sarah Palin Email Topics

Here’s a sample of the topics learnt by the model, as well as the top words for each topic. (Names, of course, are based on my own interpretation.)

  • Wildlife/BP Corrosion: game, fish, moose, wildlife, hunting, bears, polar, bear, subsistence, management, area, board, hunt, wolves, control, department, year, use, wolf, habitat, hunters, caribou, program, denby, fishing, …
  • Energy/Fuel/Oil/Mining: energy, fuel, costs, oil, alaskans, prices, cost, nome, now, high, being, home, public, power, mine, crisis, price, resource, need, community, fairbanks, rebate, use, mining, villages, …
  • Trig/Family/Inspiration: family, web, mail, god, son, from, congratulations, children, life, child, down, trig, baby, birth, love, you, syndrome, very, special, bless, old, husband, years, thank, best, …
  • Gas: gas, oil, pipeline, agia, project, natural, north, producers, companies, tax, company, energy, development, slope, production, resources, line, gasline, transcanada, said, billion, plan, administration, million, industry, …
  • Education/Waste: school, waste, education, students, schools, million, read, email, market, policy, student, year, high, news, states, program, first, report, business, management, bulletin, information, reports, 2008, quarter, …
  • Presidential Campaign/Elections: mail, web, from, thank, you, box, mccain, sarah, very, good, great, john, hope, president, sincerely, wasilla, work, keep, make, add, family, republican, support, doing, p.o, …

Here’s a sample email from the wildlife topic:

Wildlife Email

I also thought the classification for this email was really neat: the LDA model labeled it as 10% in the Presidential Campaign/Elections topic and 90% in the Wildlife topic, and it’s precisely a wildlife-based protest against Palin as a choice for VP:

Wildlife-VP Protest

Future Analysis

In a future post, I’ll perhaps see if we can glean any interesting patterns from the email topics. For example, for a quick graph now, if we look at the percentage of emails in the Trig/Family/Inspiration topic across time, we see that there’s a spike in April 2008 — exactly (and unsurprisingly) the month in which Trig was born.

Trig

Tags: , , , , ,

18 Responses to “Topic Modeling the Sarah Palin Emails”

  1. Leon Palafox June 28, 2011 at 12:07 am #

    Did you specify the number of topics ahead, or did you use some kind of Dirichlet process to let the topics create themselves?

    • Edwin Chen June 28, 2011 at 4:02 am #

      I specified the number of topics ahead of time. I haven’t played around with HDP-LDA before, though, so that would be fun to try!

  2. Arnim Bleier June 28, 2011 at 8:44 am #

    Really nice

    so it would be nice to see some (shared) work on UX 4 the browser

    see
    http://www.sccs.swarthmore.edu/users/08/ajb/tmve/wiki100k/browse/topic-list.html

    and http://code.google.com/p/tmve/updates/list

    and Davids post
    https://lists.cs.princeton.edu/pipermail/topic-models/2011-March/001242.html

    Talking about nonparametrics… it just optimizes the perplexity given the beta by choosing the number of topics 4 you

    • Edwin Chen June 28, 2011 at 11:38 am #

      Cool, thanks for the links, the wikipedia browser is awesome! Yeah, I’d love to improve on the UX once I get the time (or if someone else wants to help me =)).

  3. Andy June 28, 2011 at 11:56 am #

    Looks cool. I’d like to play with the data myself. Where were you able to download the OCR’ed data from? The best I could find is a searchable archive on Crivella West’s site; no link to download the actual email text.

  4. Shreyas Karnik June 29, 2011 at 5:12 am #

    Are you planning on sharing the code for public use?

  5. Anonymous July 6, 2011 at 7:21 pm #

    How many topics did you use? From your browser page it looks like you only used 17 (surprisingly few!) — or are those just a selected subset of the topics? Did you write your own LDA software or use an existing package? Also, how did you preprocess your data?

    • Edwin Chen July 6, 2011 at 9:04 pm #

      Yep, those are just a selected subset (17 would have been an odd number to choose O:-)). I originally used 30 topics, but some either weren’t terribly interesting from a browsing point of view (e.g., one topic seemed to cover the Yahoo ads at the bottom of a bunch of emails) or were kind of random and hard to categorize.

      I initially rolled my own LDA package, but I switched later to an LDA package from Stanford (http://nlp.stanford.edu/software/tmt).

      Preprocessing was fairly standard. I removed about 100 common stopwords, filtered out super rare and common terms, lowercased everything, and removed punctuation. In particular, I didn’t really do any special processing of the email structure — at first, I tried selecting only the latest email in each document (e.g., if the email was a reply to another email or was part of a thread, I tried to select only the most recent email) and stripped out things like headers, but in the end I just kept the whole document.

  6. mat kelcey July 11, 2011 at 8:15 am #

    Nice work!

    The time aspects of these sort of datasets has always fascinated me. Your “Trig/Family/Inspiration” topic example is a classic case of it and I wonder how different the topics be if you ran only the data before April 8th vs data after April 8th.

    I read a great paper once on the idea of running a sliding time window across the data, building a topic model each time and following the lineage of topics as they rise and fall over time; I wish I could remember which paper it was…

    • Edwin Chen July 11, 2011 at 12:54 pm #

      Great idea! I’m curious how the “Presidential Campaign/Elections” topic changes as well (e.g., maybe from rumors before the VP announcement, to support/criticism post-announcement, to comments about Tina Fey, to condolences post-election?).

      And it’d be awesome to run a temporal LDA on a book series. I tried something related a while ago, using a naive Bayes model to track how Harry Potter characters evolved over time, but didn’t get good results.

      [The paper Arnim linked to (http://www.cs.cmu.edu/~epxing/papers/2011/ahmed_etal_AISTAT11.pdf) sounds related, and googling for "temporal LDA" turns up a paper -- I haven't read either one, but maybe it was one of those?]

Trackbacks/Pingbacks

  1. Topic Modeling Sarah Palin’s Emails « Another Word For It - June 29, 2011

    [...] Topic Modeling Sarah Palin’s Emails from Edwin Chen. [...]

  2. Quora - August 22, 2011

    What is a good explanation of Latent Dirichlet Allocation?…

    Suppose you have the following set of sentences: * I ate a banana and spinach smoothie for breakfast * I like to eat broccoli and bananas. * Chinchillas and kittens are cute. * My sister adopted a kitten yesterday. * Look at this cute hamster munching …

  3. Introduction to Latent Dirichlet Allocation « Edwin Chen's Blog - August 22, 2011

    [...] got a few questions about latent Dirichlet allocation after my previous post, so I thought I’d give an [...]

  4. Quora - September 24, 2011

    How does one determine similarity between people online?…

    One idea would be to use what Henrik Schinzel discussed, and then to assign weights to them. Even if Facebook won’t allow such a thing (even though it has the architecture to easily do it), there could possibly be a new Facebook application that lets …

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 36 other followers