Surviving Algorithmic Criticism and Who is Eleanor Hodgman Porter?

What is Digital humanities?
Stephan Ramsay’s Reading Machines describes digital humanities as a “scientific method joined to humanistic inquiry” (ix). Digital humanities was founded by Roberto Busa in the late 1940s. Busa took the works of Thomas Aquinas and “deformed” the texts using a computer. Algorithmic criticism has come a long way since Busa’s experiment. Technological advancements allows for more sophisticated computerized tools that makes comparing a wide corpus of texts a reality. Such tools broaden the way hypotheses and intuitions are envisioned, tested, and realized.

Methodology:
My focus was on 20th Century American bestsellers by female authors. Most of the texts were available on Project Gutenberg. Although Project Gutenberg has subcategories of “war books” and non-fiction texts, my project focuses only on fiction. Within the range of texts by female bestsellers, I hoped to discover any precursors or distinct indicators of the first-wave of the women’s rights movement.  The corpus of the project ranges from bestsellers from the years 1900 to 1923:

Screen shot 2012-12-08 at 6.55.04 PM Screen shot 2012-12-08 at 6.55.24 PM Screen shot 2012-12-08 at 6.55.33 PM Screen shot 2012-12-08 at 6.57.39 PM Screen shot 2012-12-08 at 6.57.47 PM

Several bestsellers were unavailable:
The Kingdom of Slender Swords– Hallie Rives
Lord Loveland Discovers America– C.N. and A.M. Williamson
The Montessori Method– Maria Montessori
The Hundredth Chance– Ethel M. Dell
The Dim Lantern– Temple Bailey

Algorithmic analysis entails a lot of preprocessing decisions. For instance, not only were several texts not included in the corpus, duplicates were also omitted. Even if a text was a bestseller for consecutive years, the text was only included once into the corpus.

Lexomics:
Using the Lexomics tool, each text was scrubbed for uppercase letters, punctuation, and numbers; stop words were also scrubbed from the texts. Using diviText, each text was separated into 1 Chunk. The year, as well as the author’s name, is included in the title so they can be easily identified. Unfortunately, diviText did not allow all 76 texts/ chunks to merge. Therefore, when using treeView, the dendrograms did not consist of all texts. The dendrograms are divided by texts from 1900-1915 to 1916-1923.

1

Several dendrograms were made in an attempt to compare all texts. Later bestsellers were merged with earlier bestsellers, as well as bestsellers from 1909 to 1918. Upon completing the experiment within Lexomics, a dendrogram revealed an isolated clade that consisted of several different authors from 1921-1923. In order to learn a little (and learn quickly) about the group of texts, a word cloud was created using Voyant-tools:

2

I found that focusing on an isolated clade could reap obvious results: similarities. Separating and only analyzing a group of texts that are, well, grouped together, while excluding all other variables, disallows any conclusions concerning what makes the isolated clade different in the first place.

Phase II:
In order to compare the isolated clade with the other texts, the topic-modeling tool was used to learn more about the content of the different texts. Unfortunately, my computer would not allow me to input the entire corpus into the topic-modeling jar. With a little assistance, and a higher-performing laptop, topic models were generated. Upon receiving the “output” of the experiment, came the discovery of insufficient scrubbing, or preprocessing. The data within Project Gutenberg’s texts were included in the data, as well as an enormous amount of character names. Each text had to be rescrubbed. Creating a “unique word list” makes identifying names simple. Going down the list, each name was scrubbed out of the texts.

Phase III:
With the newly scrubbed texts, I went back to conducting the Lexomics experiments. This time, the texts were separated into 2 chunks. Again, diviText would not allow me to merge the entire corpus; therefore, the dendrograms were divided again. Surprisingly, the dendrogram gave different results from the initial experiment. The first isolated clade was no long present; however, a new isolated clade was discovered. For reassurance, the texts were rescrubbed, reviewed, and re-dendrogramed.

3

The isolated clade from the third phase of experiments consisted of two bestsellers by Eleanor Hodgman Porter: Dawn (1919) and Mary-Marie (1920). Out of curiosity, I compiled all of Eleanor Porter’s bestsellers into treeView and created a new dendrogram. Porter’s Mary-Marie was on its own clade. Using Wordle, a word cloud was created for Mary-Marie:

4

Again, using the topic-modeling tool, several topic models were created. In topics, such as “growth,” “tradition,” and “family,” Eleanor Porter’s works were found to be “low-ranking.”
Perhaps, Eleanor Porter’s works were isolated because her writing style coincided closer to male authors than female authors. Therefore, I included Porters works and works of three different male authors for comparison:

5              6

Porter’s 1916 bestseller, Just David, was grouped together with Charles Dickens’ novel, Great Expectations. Upon research, I found that Just David is also about an orphaned boy. Regardless, it is interesting that content and style seems to collide.
Eleanor H. Porter died on May 21, 1920, which was the same year Mary-Marie was published. I then remembered a class discussion about Agatha Christie.  Through algorithmic analysis, professors at the Univeristy of Toronto discovered that “Christie’s lexicon decreased with age,” which are possible “linguistic indicators of the cognitive deficits typical of Alzheimer’s disease” (The New York Times). Could the stylistic differences of Eleanor Porter’s Mary-Marie be contributed to a possible sickness?

Who is Eleanor Hodgman Porter?

7

Although Eleanor Hodgman Porter published fifteen novels, six of which were bestsellers, and numerous short stories, information on Porter is surprisingly scarce. Most of the information on Porter is based on her most famous novel, Pollyanna. The Dictionary of Literary Biography has a brief section on Eleanor Porter. Porter, using the pseudonym Eleanor Stewart, began her career writing short stories for women’s magazines. Again, the biography also delves into the popularity of Pollyanna and her other popular works. Investigating Eleanor Porter’s death is also difficult; an obituary record of Porter is unavailable. The search for Porter’s pseudonym came back with bleaker results. So, who is Eleanor Hodgman Porter? My search for stylistic differences or distinct indicators of the feminist movement, turned into a search for a prolific author, Eleanor Porter.

If I had more time to conduct my research, I would have loved to continue my search for Eleanor Porter.  Also, I would be interested in doing additional experiments of Porter that includes all of her works, rather than focusing on her bestsellers.

Learning Through Progress:
Upon completing the text-based analysis of 20th century female bestsellers, I learned several things about the process. First, preprocessing is very time consuming, but that’s not necessarily a bad thing. Although it was sometimes painful, taking the time to understand the thoughtfulness involved in algorithmic criticism and the process enabled me to deeply invest in the work. There are not shortcuts! Which words should I scrub, or leave in? Are character names valuable for certain experiments? How should I chunk the texts? Each step and phase requires careful attention.

Moving forward allows you to identify and learn from your mistakes. For instance, deficient scrubbing, or preprocessing, became evident once I progressed into the experimenting stage. The texts were filled with names, as well as Project Gutenberg information. Mistakes are a vital part of learning in the digital humanities; and like everything else, it’s also inevitable. All you can do is go back; but by doing so, you progress forward with a keener eye.

The text-based analysis also helped me realize that we must expand the availability of digitized texts. Initially, hoped to conduct my research on Asian American authors. However, a digitized form was nearly nonexistent. Then, I expanded my scope (although I thought) to a variety of texts by minority women authors. The lack of availability forced me to focus on all white female authors instead. And even then, I encountered availability issues. Algorithmic criticism is invaluable; therefore, it is of extreme importance to make a conscious effort to include minority authors and female authors within the literary discourse of the digital humanities. As the algorithmic approach continues to make undeniable imprints into literary analysis, we must include minority and female authors along for the ride.