Rob Hardy on books


Article Comment 

Understanding Literature via Computation



Rob Hardy


Everyone knows that computers are changing everything. You might not expect that they are changing the way that professors of English analyze literature. This is, however, happening, and is leading to unexpected insights. is a sort of manifesto for the use of enormous databases to tell us about trends and meanings that could not be perceived before. Jockers says that the emerging field has come to be called "digital humanities" after being known as "humanities computing." Concordance work preceded the use of digital computers, but computers have brought powerful ways to seek answers. "The humanities computing/digital humanities revolution has now begun," writes Jockers, "and big data have been a major catalyst. The questions we may now ask were previously inconceivable, and to answer these questions requires a new methodology, a new way of thinking about our object of study." 




It certainly must be new for the English professors who have previously done their work without computers. Jockers gives the example of "close reading," by which a scholar might intensively study a mere chapter of a work, or a few works, examining antecedents, influences, themes, and so on. Such studies remain valuable, of course, but the most literate professor can only read a relatively few texts, and though his conclusions from them might be sound, there is a lot more information out there. "Like it or not, today's literary-historical scholar can no longer risk being just a close reader: the sheer quantity of available data makes the traditional practice of close reading untenable as an exhaustive or definitive method of evidence gathering. Something important will inevitably be missed." There is bound to be reluctance on the part of what I picture to be tweedy professorial types with patches on the elbows of their jackets. Jockers seems to be well able to bridge the spread between nineteenth century literature and the requirements of programming and statistics needed for the sort of data mining he describes here. Anyone who can write explanations like, "To calculate this relationship, we can use the Pearson correlation coefficient formula, which takes the covariance of two variables and divides by the product of their standard deviations" is not just a specialist in literature. 




One of the surprising findings described here is that little words make the great difference. A writer might have to decide, perhaps repeatedly, whether to describe a sunset as "beautiful" or as "magnificent." Such choices take conscious reflection, but they tell less than does the use of little, automatic words like "the," "of," and "it." These words, outside of our conscious control, are, says Jockers, like the "tell" exhibited by a poker player. This gives clues to authorship. An almost perfect lab for testing such distinctions is The Federalist Papers, which consists of eighty-five different papers, fifteen of which are of unclear authorship, but all of which had to be written by James Madison, Alexander Hamilton, or John Jay. The authors were male and contemporaries, and they wrote in ways typical of the society in which they all took part and they wrote about related subjects, but they have differences that enable attribution for the papers that have no names to them. Hamilton, for instance, uses "a" much more than the other two, and uses "to be" more often than the "is" used by the others. It would be excruciating work to tally up such tiny, seemingly insignificant differences, but that's the sort of work a computer can do with ease, giving insight into authorship that is available no other way. 




A computer cannot understand a novel the way we do, but that does not stop it from being able to spot styles. There are various styles of novel that have been analyzed, like the Bildungsroman novel (one which deals with the formative years of the main character), the Newgate novel (having to do with prisons), or the Gothic novel. Bildungsroman novels, unsurprisingly, show an overrepresentation of words such as "little" and "young." It is less clear why they should use more of the word "like," but Jockers explains that there are frequent comparisons made between the protagonist's childhood experience and his adult world. (From David Copperfield, Jockers can even show that the frequencies of "like" and "little" decrease as the novel progresses.) Newgate novels have disproportionately low use of female pronouns, but also have a high use of the exclamation point (something Jockers notes is relatively absent from another type of work, the evangelical novel). Gothic novels have an abundance of "locative prepositions" ("over," "on," "within," etc.) which he says is a result of the genre's being "place oriented," presumably as the characters make their murky way along staircases and secret passageways within the ruined castles that are standard setting for such works. Distinctions like these are helping literary scholars define what genres are out there, and the foundations for classifying novels into them. Computers know nothing of plot or protagonists, but they have proven capable of defining genres by word use and they can make charts called dendrograms that look like family trees the evolutionists use, showing how one style descended from another. 




Computer evaluation can look at differences in decades and in regional usage as well as authorship and style. One of the most peculiar and inexplicable findings here could have been found in no other way. British authors use the word "the" less than American ones; sometimes the difference is obvious in a phrase like "I have to go to hospital" versus "I have to go to the hospital." You can even set your computer to count the frequency of the word "the," and you can thereby get a good prediction about whether the text is British or American. It is a little surprising that the frequency of usage of "the" varies from year to year. What is completely surprising is that though Americans use fewer instances of "the" than do the British, if the rate goes up or down over time, it goes up or down in both British and American usage; the changes are parallel. "The" is not a fad word that comes and goes with colloquialisms, but a trivial word that is used unconsciously. It is as if British speakers and American speakers made a collaborative effort to modulate the frequency of "the" together. Jockers has tried hard to explain how the consistent parallel changes have happened, but confesses, "My conclusion regarding this phenomenon is quite literally 'to be determined.'" 




Jockers understands that there are limitations for this sort of evaluation of literature, and jokes about how "more controversial and objectionable would be an argument along the lines of 'Moby Dick is God, and I have the numbers to prove it.'" I thought also of the great expression of the difference between understanding literature and enjoying it, in the description in Catch-22 of the character Clevinger: "He knew everything there was to know about literature, except how to enjoy it." Computers with their word counts, endlessly sorted and refined, might help our understanding; if this understanding helps our enjoyment, let them keep crunching. 




back to top




Follow Us:

Follow Us on Facebook

Follow Us on Twitter

Follow Us via Email