The power of Ngrams

For me, one sign of a really good book is that I learn things I wasn’t expecting to learn. I had that experience while reading almost every chapter of Uncharted: Big Data as a Lens on Human Culture. The book is written by the creators of Google’s Ngram Viewer, which is a tool that shows the frequency of any word or phrase (single words are 1-grams, 2-word phrases are 2-grams…) in the massive and continually growing corpus of books in the Google Books database. The most informative feature of Ngram Viewer is that you can compare frequencies of different phrases to each other and see changes in their use over time (here’s a holiday phrase comparison that I made.).


The book includes many ngram comparisons that are much more informative than mine. It tells the story of the Ngram Viewer’s birth, shows lots of interesting ngram comparisons, and goes more in depth on a variety of uses. Maybe the most surprising use is that ngrams can reflect censorship efforts. By looking at the slopes of the changes in frequency for different people’s names during the Nazi regime, it becomes clear that some names were being censored (those ngrams have negative slopes for that time period) and others were rising in prominence (those have positive slopes). When compared with historical records, the ngram-based conclusions are strikingly accurate.

The book only shows a tiny slice of what the Ngram Viewer can be used to learn. It’s the epitome of cognitive science, piecing together wisdom from many disciplines. Ngram Viewer is a great tool, whether you’re at home on the couch wondering when the phrase “Merry Christmas” became popular, or doing paid research, and this book was a cool way to learn more about it.

I'm partial to this comparison (found on the About page for Ngram Viewer -
I’m partial to this comparison (found on the About page for Ngram Viewer)

Merry Christmas or Happy New Year?

A fun graph of the competition between the phrases “Merry Christmas” and “Happy New Year” (at least in published books that have been incorporated into Google Books) courtesy of Google’s Ngram viewer. More on this enlightening tool to come, inspired by Uncharted: Big Data as a Lens on Human Culture.

Screen Shot 2014-12-24 at 3.59.06 PM
y-axis: of all the bigrams (2-word phrases) contained in Google’s sample of books [written in English and published in the United States], what percentage of them are “Merry Christmas”? Of all the trigrams (3-word phrases), what percentage of them are “Happy New Year”?