You are looking at archived content from my "Bookworm" blog, an experiment that ran from 2014-2016. Not all content may work. For current posts, see here.

Posts with tag bookworm

← Back to all posts

Jan 19 2016

Arranging the novels from the txtLAB

Andrew Piper announced yesterday that the McGill text lab is releasing their corpus of modern novels in three languages. One of first thoughts with any corpus is: what existing Bookworm methods might add some value here? It only took about ten minutes to write the code to import it into a bookworm; the challenge is figuring how methods developed for millions of books can be useful on a set of just 450.

Dec 15 2015

Hansard

A first pass at understanding the potential of the Hansard corpus through a Bookworm browser.

Oct 20 2015

Bookworm D3 layouts

There’s no full description of the D3 bookworm package yet, because it’s still something of a moving target.

Sep 18 2015

Bookworm 0.4, with new features and usability improvements

Bookworm 0.4 is now released on github. It contains a number of improvements to the code from over the summer. It makes the existing code much, much more sensible for anyone wanting to build a bookworm on their own collections of texts based on the experience of many using it so far. All the stages: installation, configuration, and testing are now a lot easier. So if you have a collection of texts you wish to explore, I welcome you to test it out. (I’ll explain at more length later, but for the absolute lowest investment of time you can just run a prebuilt bookworm virtual machine using vagrant.)

Sep 14 2015

Genre Classification from Topic models

This post is just kind of playing around in code, rather than any particular argument. It shows the outlines of using the features stored in a Bookworm for all sorts of machine learning, by testing how well a logistic regression classifier can predict IMDB genre based on the subtitles of television episodes.

Jon “Fitz” Fitzgerald was asking me for example for training a genre classifer on textual data. To reduce dimensionality into the model, we have been thinking of using a topic model as the classifiers instead of the tokens. The idea is that classifiers with more than several dozen variables tend to get finicky and hard to interpret, and with more than a few hundred become completely unmanageable. If you want to classify texts based on their vocabularies, you have two choices:

Only use some of the words as classifiers. This is the normal approach, used from Mosteller and Wallace on the Federalist papers through to Ted Underwood’s work on classifying genre in books.
Aggregate the words somehow.¹ The best way, from an information-theoretic point of view, is to use the first several principal components of the term-document matrix as your aggregators. This is hard, though, because principal components vectors are hard to interpret and on very large corpora (for which the term-document matrix doesn’t fit in memory) somewhat tedious to calculate.

Using topic models as classifiers is somewhat appealing. They should be worse at classification than principal components, but they should also be readable like words, to some degree. I haven’t seen it done that much because there are some obvious problems; topic models are time-consuming to fit, and they usually throw out stopwords which tend to be extremely successful at classification problems. That’s where a system like Bookworm, which will just add in a one-size-fits-all topic model with a single command, can help; it lets you try loading in a pre-computed model to see what works.

So this post just walks through some of the problems with genre classification in a corpus of 44,000 television episodes and a pre-fit topic model. I don’t compare it directly to existing methods, in large part because it quickly becomes clear that “IMDB genre” is such a flexible thing that it’s all but impossible to assess whether a classifier is working on anything but a subjective level. But I do include all of the code for anyone who wants to try fitting something else.

Code, Descriptions, and charts

Note: all the code below assumes the libraries dplyr, bookworm, and tidyr are loaded.

First we make the data wide (columns as topic labels). That gives us 127 topics across 44,258 episodes of television, each tagged with a genre by IMDB.

wide = movies %>% spread(topic_label,WordsPerMillion,fill=0)

Now we’ll train a model. We’re going to do logistic regression (in R, a glm with family=binomial), but I’ll define a more general function that can take an svm or something more exotic for testing.

# Our feature set is a matrix without the categorical variables and a junk variable getting introduced somehow.
modeling_matrix = wide %>% select(-TV_show,-primary_genre,-season,-episode,-`0`) %>% as.matrix
training = sample(c(TRUE,FALSE),nrow(modeling_matrix),replace=T)

dim(modeling_matrix)

## [1] 44258   127

training_frame = data.frame(modeling_matrix[training,])
training_frame$match = NA
build_model = function(genre,model_function=glm,...) {
  # genre is a string indicating one of the primary_genre fields;
  # model function is something like "glm" or "svm";
  # are further arguments passed to that function.
  training_frame$match=as.numeric(wide$primary_genre == genre)[training]
  # we model against a matrix: the columns are the topics, which we get by dropping out the other four elements
  model = model_function(match ~ ., training_frame,...)
}

Here’s a plot of the top genres. I’ll model on the first ten, because there’s a nice break before game show, reality, and fantasy.

library(ggplot2)

wide %>% filter(training) %>% group_by(primary_genre) %>% summarize(episodes=n()) %>% mutate(rank=rank(-episodes)) %>% arrange(rank) %>% ggplot() + geom_bar(aes(y=episodes,x=reorder(primary_genre,episodes),fill=rank<=7),stat="identity") + coord_flip() + labs(title="most common genres, by number of episodes in training set")

Jul 02 2015

Movie Geographies

I just saw Matt Wilkens’ talk at the Digital Humanities conference on places mentioned in books; I wanted to put up, mostly for him, a quick stab at some of the raw data running the equivalents on my movie bookworm.

May 22 2015

Pace of Change replications

This is a quick post to share some ideas for interacting with the data underlying the recent article by Ted Underwood and Jordan Sellers on the pace of change in literary standards for poetry.

May 11 2015

Story Time.

Here are some interactives I’ve made in preparation for my talk at the Literary Lab at Stanford on Tuesday on plot arcs in television shows based on underlying language.

Apr 20 2015

Writing up text analysis for immediate interaction `<em>`{=html}and`</em>`{=html} long-term persistence.

Though more and more outside groups are starting to adopt Bookworm for their own projects, I haven’t yet written quite as much as I’d like about how it should work. This blog is attempt to rectify that, and begin to explain how a combination of blogging software, interactive textual visualizations, and a exploratory data analysis API for bag-of-words models can make it possible to quickly and usefully share texts through a Bookworm installation.