LDA-based Email Browser
Earlier this month, several thousand emails from Sarah Palin’s time as governor of Alaska were released. The emails weren’t organized in any fashion, though, so to make them easier to browse, I’ve been working on some topic modeling (in particular, using latent Dirichlet allocation) to separate the documents into different groups.
I threw up a simple demo app to view the organized documents here.
What is Latent Dirichlet Allocation?
Briefly, given a set of documents, LDA tries to learn the latent topics underlying the set. It represents each document as a mixture of topics (generated from a Dirichlet distribution), each of which emits words with a certain probability.
For example, given the sentence “I listened to Justin Bieber and Lady Gaga on the radio while driving around in my car”, an LDA model might represent this sentence as 75% about music (a topic which, say, emits the words Bieber with 10% probability, Gaga with 5% probability, radio with 1% probability, and so on) and 25% about cars (which might emit driving with 15% probability and cars with 10% probability).
If you’re familiar with latent semantic analysis, you can think of LDA as a generative version.
(For a more in-depth explanation, I wrote an introduction to LDA here.)
Sarah Palin Email Topics
Here’s a sample of the topics learnt by the model, as well as the top words for each topic. (Names, of course, are based on my own interpretation.)
- Wildlife/BP Corrosion: game, fish, moose, wildlife, hunting, bears, polar, bear, subsistence, management, area, board, hunt, wolves, control, department, year, use, wolf, habitat, hunters, caribou, program, denby, fishing, …
- Energy/Fuel/Oil/Mining: energy, fuel, costs, oil, alaskans, prices, cost, nome, now, high, being, home, public, power, mine, crisis, price, resource, need, community, fairbanks, rebate, use, mining, villages, …
- Trig/Family/Inspiration: family, web, mail, god, son, from, congratulations, children, life, child, down, trig, baby, birth, love, you, syndrome, very, special, bless, old, husband, years, thank, best, …
- Gas: gas, oil, pipeline, agia, project, natural, north, producers, companies, tax, company, energy, development, slope, production, resources, line, gasline, transcanada, said, billion, plan, administration, million, industry, …
- Education/Waste: school, waste, education, students, schools, million, read, email, market, policy, student, year, high, news, states, program, first, report, business, management, bulletin, information, reports, 2008, quarter, …
- Presidential Campaign/Elections: mail, web, from, thank, you, box, mccain, sarah, very, good, great, john, hope, president, sincerely, wasilla, work, keep, make, add, family, republican, support, doing, p.o, …
Here’s a sample email from the wildlife topic:
I also thought the classification for this email was really neat: the LDA model labeled it as 10% in the Presidential Campaign/Elections topic and 90% in the Wildlife topic, and it’s precisely a wildlife-based protest against Palin as a choice for VP:
Future Analysis
In a future post, I’ll perhaps see if we can glean any interesting patterns from the email topics. For example, for a quick graph now, if we look at the percentage of emails in the Trig/Family/Inspiration topic across time, we see that there’s a spike in April 2008 — exactly (and unsurprisingly) the month in which Trig was born.
Tags: latent dirichlet allocation, lda, nlp, r, sarah palin, topic models



Did you specify the number of topics ahead, or did you use some kind of Dirichlet process to let the topics create themselves?
I specified the number of topics ahead of time. I haven’t played around with HDP-LDA before, though, so that would be fun to try!
Really nice
so it would be nice to see some (shared) work on UX 4 the browser
see
http://www.sccs.swarthmore.edu/users/08/ajb/tmve/wiki100k/browse/topic-list.html
and http://code.google.com/p/tmve/updates/list
and Davids post
https://lists.cs.princeton.edu/pipermail/topic-models/2011-March/001242.html
Talking about nonparametrics… it just optimizes the perplexity given the beta by choosing the number of topics 4 you
Cool, thanks for the links, the wikipedia browser is awesome! Yeah, I’d love to improve on the UX once I get the time (or if someone else wants to help me =)).
Looks cool. I’d like to play with the data myself. Where were you able to download the OCR’ed data from? The best I could find is a searchable archive on Crivella West’s site; no link to download the actual email text.
http://opani.com/help/sarah-palin-email
Would love to see how a Storyline based aproach like
http://www.cs.cmu.edu/~epxing/papers/2011/ahmed_etal_AISTAT11.pdf does on these data. Unfortunately I dont know an implementation of it.
D’oh, I totally forgot to link to the data. Besides the Opani site Arnim linked to, it’s also available from Sunlight Labs: https://github.com/sunlightlabs/sarahs_inbox
Are you planning on sharing the code for public use?
Yep, I put the code here: https://github.com/echen/sarah-palin-lda
Thanks a lot! The code is helpful.
How many topics did you use? From your browser page it looks like you only used 17 (surprisingly few!) — or are those just a selected subset of the topics? Did you write your own LDA software or use an existing package? Also, how did you preprocess your data?
Yep, those are just a selected subset (17 would have been an odd number to choose O:-)). I originally used 30 topics, but some either weren’t terribly interesting from a browsing point of view (e.g., one topic seemed to cover the Yahoo ads at the bottom of a bunch of emails) or were kind of random and hard to categorize.
I initially rolled my own LDA package, but I switched later to an LDA package from Stanford (http://nlp.stanford.edu/software/tmt).
Preprocessing was fairly standard. I removed about 100 common stopwords, filtered out super rare and common terms, lowercased everything, and removed punctuation. In particular, I didn’t really do any special processing of the email structure — at first, I tried selecting only the latest email in each document (e.g., if the email was a reply to another email or was part of a thread, I tried to select only the most recent email) and stripped out things like headers, but in the end I just kept the whole document.
Nice work!
The time aspects of these sort of datasets has always fascinated me. Your “Trig/Family/Inspiration” topic example is a classic case of it and I wonder how different the topics be if you ran only the data before April 8th vs data after April 8th.
I read a great paper once on the idea of running a sliding time window across the data, building a topic model each time and following the lineage of topics as they rise and fall over time; I wish I could remember which paper it was…
Great idea! I’m curious how the “Presidential Campaign/Elections” topic changes as well (e.g., maybe from rumors before the VP announcement, to support/criticism post-announcement, to comments about Tina Fey, to condolences post-election?).
And it’d be awesome to run a temporal LDA on a book series. I tried something related a while ago, using a naive Bayes model to track how Harry Potter characters evolved over time, but didn’t get good results.
[The paper Arnim linked to (http://www.cs.cmu.edu/~epxing/papers/2011/ahmed_etal_AISTAT11.pdf) sounds related, and googling for "temporal LDA" turns up a paper -- I haven't read either one, but maybe it was one of those?]