Posts with tag pca
Back to all postsBefore end-of-semester madness, I was looking at how shifts in vocabulary usage occur. In many cases, I found, vocabulary change doesn’t happen evenly across across all authors. Instead, it can happen generationally; older people tend to use words at the rate that was common in their youth, and younger people anticipate future word patterns. An eighty-year-old in 1880 uses a world like “outside” more like a 40-year-old in 1840 than he does like a 40-year-old in 1880. The original post has a more detailed explanation.
I wanted to see how well the vector space model of documents I’ve been using for PCA works at classifying individual books. [Note at the outset: this post swings back from the technical stuff about halfway through, if you’re sick of the charts.] While at the genre level the separation looks pretty nice, some of my earlier experiments with PCA, as well as some of what I read in the Stanford Literature Lab’s Pamphlet One, made me suspect individual books would be sloppier. There are a couple different ways to ask this question. One is to just drop the books as individual points on top of the separated genres, so we can see how they fit into the established space. By the first two principal components, for example, we can make all the books in LCC subclasses “BF” (psychology) blue, and use red for “QE” (Geology), overlaying them on a chart of the first two principal components like I’ve been using for the last two posts:
I used principal components analysis at the end of my last post to create a two-dimensional plot of genre based on similarities in word usage. As a reminder, here’s an improved (using all my data on the 10,000 most common words) version of that plot:
One of the most important services a computer can provide for us is a different way of reading. It’s fast, bad at grammar, good at counting, and generally provides a different perspective on texts we already know in one way.
And though a text can be a book, it can also be something much larger. Take library call numbers. Library of Congress headings classifications are probably the best hierarchical classification of books we’ll ever get. Certainly they’re the best human-done hierarchical classification. It’s literally taken decades for librarians to amass the card catalogs we have now, with their classifications of every book in every university library down to several degrees of specificity. But they’re also a little foreign, at times, and it’s not clear how well they’ll correspond to machine-centric ways of categorizing books. I’ve been playing around with some of the data on LCC headings classes and subclasses with some vague ideas of what it might be useful for and how we can use categorized genre to learn about patterns in intellectual history. This post is the first part of that.
Back to my own stuff. Before the Ngrams stuff came up, I was working on ways of finding books that share similar vocabularies. I said at the end of my second ngrams post that we have hundreds of thousands of dimensions for each book: let me explain what I mean. My regular readers were unconvinced, I think, by my first foray here into principal components, but I’m going to try again. This post is largely a test of whether I can explain principal components analysis to people who don’t know about it so: correct me if you already understand PCA, and let me know me know what’s unclear if you don’t. (Or, it goes without saying, skip it.)
Let me get ahead of myself a little.
For reasons related to my metadata, I had my computer assemble some data on the frequencies of the most common words (I explain why at the end of the post.) But it raises some exciting possibilities using forms of clustering and principal components analysis (PCA); I can’t resist speculating a little bit about what else it can do to help explore ways different languages intersect. With some charts at the bottom.