By Dennis Baron
People judge you by the words you use. This warning, once the slogan of a vocabulary building course, is now the mantra of the new science of culturomics.
In “Quantitative Analysis of Culture Using Millions of Digitized Books” (Michel, et al., Science, Dec. 17, 2010), a Harvard-led research team introduces “culturomics” as “the application of high throughput data collection and analysis to the study of human culture.” In plain English, they crunched a database of 500 billion words contained in 5 million books published between 1500 and 2008 in English and several other languages and digitized by Google. The resulting analysis provides insight into the state of these languages, how they change, and how they reflect culture at any given point in time.
In still plainer English, they turned Google Books into a massively-multiplayer online game where players track word frequency and guess what writers from 1500 to 2008 were thinking, and why. The words you use tell the culturonomists exactly who you are–and they can even graph the results!
According to the psychologists and mathematicians on the culturomics team, reducing books and their words to numbers and graphs will finally give the fuzzy humanistic interpretation of history, literature, and the arts the rigorous scientific footing it has lacked for so long.
For example, the graph below tracks the frequency of the name Marc Chagall (1887-1985) in English and German books from 1900 to 2000, revealing a sharp dip in German mentions of the modernist Jewish artist from 1933 to 1945. You don’t need a graph to correlate Hitler’s ban on Chagall and his work with the artist’s disappearance from German print (other Jewish artists weren’t just censored by the Nazis, they were murdered), but it is interesting to note that both before and after the Hitler era, Chagall garners significantly more mentions in German books than he does in English ones.
One problem with the culturome data set is that books don’t always reflect the spoken language accurately. When the telephone was invented in 1876, Americans adapted hello as a greeting to use when answering calls. Before that time, hello was an extremely rare word that served as a way of hailing a boat or as an expression of surprise. But as the telephone spread across American cities, hello quickly became the customary greeting both for telephone, and then for face-to-face, conversation.
Expanding the data set of written English to include not just books but also newspapers, periodicals, letters, and informal writing, as we find in the smaller, 400-million word Corpus of Historical American English, gives a better idea of the frequency of words like hello. But crunching numbers doesn’t tell the whole story: we can infer from contemporary published accounts, many of them strong objections to the new term, that hello is much more common in speech than its occurrence in writing indicates.
It’s one thing to read a book and speculate about its meaning—that’s what readers are supposed to do. But culturomics crunches millions of books—more than the most ardent book club groupie could get through in a lifetime. Since mos