Emily Thinks About DigiLit: Adventures in Distant Reading

To tackle this idea of “distant reading,” I read a selection of articles and blog posts on the subject and was asked to then use a couple commonly accessible distant reading sites to put the idea into practice. Distant reading is the practice of taking an extremely large body of literary work – for example, all plays written in England between 1500 and 1700 – dumping them into a computer processor, and then analyzing the trends that you see such as word frequency or plot types. The idea is that recognizing patterns in an astronomical sample of texts will give us insight into the literary time period. This is a new idea, and I believe that distant reading – because it is so accessible – can become an excellent aid for scholars and theorists of literature.

To put this idea into practice, I used two commonly found websites that participate in this big data analysis. First, I entered five different articles/blog posts on this idea of distant reading (referenced below) into the program called Voyant. Then I entered some data into Google’s Ngram program to gain a different perspective on the texts.

When you dump a collection of texts into Voyant, the program recognizes word repetition and presents it to the user in a variety of ways such as graphs, clouds, text highlighting, etc.. I was let down to find that the most used words in the readings were “the,” “of,” “to,” “a,” etc.. I searched and searched, but couldn’t find a filter that would eliminate those seemingly benign words. (I call them benign, because they didn’t align with my interests. However, as one article pointed out, Gothic literature can be defined by the overwhelming use of “the.”) While it is seemingly missing a filter, Voyant does allow the user to manipulate the data. I was able to select as many top words as I wanted and see how their graphs compare. Each word’s graph maps how often it was used in each reading. When comparing graphs of certain words we can start noticing trends throughout the readings. As Mae Capozzi noted in her blog “Reading at a Distance” having visuals like these maps bring literary theory out of the phase of abstraction and gives it a presence that is almost physical (Capozzi). No longer do theorists and critics have to simply talk about their ideas; now they have visualizations.

When I manipulated my data to show me a graph of how the top more important words were distributed throughout the readings, I got an interesting result. The words I chose to map out were: reading, more, books, topic, digital, humanities, literary, new, moretti, literature, words, and distant. These were the top words excluding articles, pronouns, state-of-being verbs, qualifiers, etc. If I had only seen this list of top words but never read the articles and blogs, I would still be able to infer that distant reading is a new development in the way we read literature that incorporates the digital humanities. If I had a larger sample of texts, I would probably be able to hone my understanding of distant reading down to an even more accurate and precise definition.

My experience with Google Ngrams was a little different than Voyant. I couldn’t find a way to enter all of the readings into that processor, and if I could have, I would assume that the data would look pretty much the same as it did in Voyant. Instead, I took the key concepts and entered them into the search bar. Initially I chose to examine the use of the words “digital humanities,” “distant reading,” and “big data” from 1950 (the decade of the start of what we know as digital humanities) to 2008 (I tried to expand the search to 2015, but every time I tried it was reset to 2008) in English. Ngram allowed me to see the trends in usage for these words. All three started to trend upward in the 90s – likely thanks to the internet. The term “digital humanities” wasn’t used at all until the advent of the internet in the nineties. I found it interesting that the term “big data” has been used since the invention of the computer, as the computer gave people the opportunity to archive and catalogue. “Big data” took a dramatic increase in usage between 1990 and 2000 when it peaked.

Using Ngram in this way also provided evidence for the pit falls of distant reading. Firstly, in the graph I referenced above, the term “distant reading” seems to have been frequently used around 1950. To see this in greater detail, I extended my time frame to examine texts in English between 1900 and 2008. It seems that the term “distant reading” was relatively popular between 1935 and 1955 (this can be seen in the graph below). Surely, the people using “distant reading” back then were not referring to it in the way that we are today. However, the Ngram processor doesn’t know this. If someone is analyzing information using Ngram, they could very easily be tripped up by confused data.

I think that distant reading tools like Voyant and Ngram could be great tools to use in the classroom especially when summing up a literary era. Have students make predictions about a certain literary era before reading texts: common themes, obstacles, world views, etc.. Then, have students read 5 – 10 texts or excerpts of texts from that era. They would be expected to close read these texts. After their close reading, they would be asked to identify what they believe would be commonalities throughout texts of the period. Once the students are finished reading, have them enter a large sample of texts from this period into one of these big data processors. By recognizing word repetition, have the students reevaluate the common themes they predicted. Ask them: how does distant reading compare to close reading? From which did they gain the best understanding of the literary era? How do the two compliment each other? Then, the students can enter some of the themes together into Google Ngram and see how they have evolved, connected, or opposed each other over time. Both these tools could be wonderful resources in the classroom.

After reading about distant reading and having practice putting it to use, I view it as a nice companion to traditional close reading. As Joshua Rothman stated at the end of his article, “An Attempt to Discover the Laws of Literature,” “We can continue to read the old fashioned way. Moretti, from afar, will tell us what he learns” (Rothman). I think that to analyze literature solely from afar is to strip it of everything that makes it worthy of analysis at all. However, this big data information that we are discovering as we analyze large corpus of texts, can be useful in its ability to uncover new trends in past periods of not only literature but also human cultures.

Readings Referenced:

Capozzi, Mae. “Reading from a Distance.” Blog. https://readingfromadistance.wordpress.com/

Cohen, Patricia. “Analyzing Literature by Words and Numbers.” The New York Times. 3 Dec. 2010. Web. 27 July 2015.

Rothman, Joshua. “An Attempt to Discover the Laws of Literature.” The New Yorker. 20 March 2014. Web. 27 July 2015.

Schulz, Kathryn. “What is Distant Reading?” The New York Times Sunday Book Review. 24 June 2011. Web. 27 July 2015.

Underwood, Ted. “How Not to Do Things with Words.” Blog. 25 Aug 2012. Web. 27 July 2015.

Emily Thinks About DigiLit

Monday, July 27, 2015

Adventures in Distant Reading

1 comment: