Google N-Grams

Using the Google N-gram large dataset, I set out to determine who has loved and hated whom in the whole history of literature.

In 2009, Google made available their n-gram data for public use.

That’s a pretty big deal.

Why? Because it’s a ridiculously awesome endeavor where Google scanned millions of books spanning the past 500 years, which also equates to about 15% of humankind’s published books!

That also adds up to approximately 500 billion words, just for American English books alone!

Combine this with the books in different languages, this means that the Google N-gram dataset is tens (hundreds, even?) of terabytes large, at least.

And it’s still growing as they scan more books.

Here a little brief story of what I did with this data.

Client

Complex Networks, Professor Maximilian Schich

Responsibilities

Data Retrieval, Analysis, Cleanup, Parsing, and Visualization

The N-gram

In my research at UT-Dallas, I always seem to come back to the subjects of semiotics and language.

It’s not surprising that my most recent research project involved using the Google N-gram dataset which deals with the context of language.

In case the information found at the link above is too dry or complicated,  allow me to briefly explain:

An N-gram simply pertains to a word or set of words within a string, “n” words long.

For instance…

N-Gram Examples

1-gram

  • “you”
  • “Visualization”
  • “a”
  • “hello”
  • “awesome”
#mk-custom-box-59eae812509bc { min-height:100px; padding:30px 20px; background-attachment:scroll; background-repeat:repeat; background-color:#f6f6f6; background-position:left top; margin-bottom:10px; } #mk-custom-box-59eae812509bc .mk-fancy-title.pattern-style span{ background-color: #f6f6f6 !important; }

3-gram

  • “I love you”
  • “created data visualizations”
  • “a bee stung”
  • “hello is it”
  • “is ridiculously awesome”
#mk-custom-box-59eae81251e03 { min-height:100px; padding:30px 20px; background-attachment:scroll; background-repeat:repeat; background-color:#f6f6f6; background-position:left top; margin-bottom:10px; } #mk-custom-box-59eae81251e03 .mk-fancy-title.pattern-style span{ background-color: #f6f6f6 !important; }

5-gram

  • “I love you and your”
  • “I created data visualizations for”
  • “a bee stung me on”
  • “hello is it me you’re”
  • “This is ridiculously awesome sauce”
#mk-custom-box-59eae81252ed5 { min-height:100px; padding:30px 20px; background-attachment:scroll; background-repeat:repeat; background-color:#f6f6f6; background-position:left top; margin-bottom:10px; } #mk-custom-box-59eae81252ed5 .mk-fancy-title.pattern-style span{ background-color: #f6f6f6 !important; }

LOVE

“All You Need is Love”

The N-gram data-set piqued my curiosity, so I thought I’d do some sort of visualization experiment to introduce myself to exploring this data-set.

As a proof of concept, I thought it would be fun to find who loved whom within the 500 year history of literature, made possible by Google.

My instincts told me that I would be left with an enormous data-set, so I only limited my parameters to only:

  • 3-grams
  • Proper Nouns
  • limited tenses of the verb: “love”,”loved”, and “loves”

Process

At the advice of my professor, Dr. Maximilian Schich, I taught myself the Python programming language because of it’s useful libraries in Natural Language Processing.

In the end, I never used any special libraries, such as NLTK, to do any textual parsing or n-gram creation – I pretty much coded everything myself, mostly for the challenge of it.

In the end I was able to obtain my “love” dataset and visualized  a complex network  representing “Who loves Whom” in the history of literature.

HATE

“Du hast mich!”

Once I created the love network, I realized I was making quite a bit of progress and was ahead of schedule. I wondered what a networked visualization of “hate” would look like.

With some minor changes on my original Python Scripts, I was able to retrieve a dataset consisting of 3-grams with “hate” as the link between 2 words.

Not long after, I created a visualization of the network showing “Who hates Whom” in the history of literature.

I was surprised to find that the “hate” network left me with far fewer connections than anticipated, although it’s nice to see that “Hate” isn’t a word thrown around as loosely as “Love”, in literature.

Love & Hate

“With their powers combined…!”

Despite the small-ish network created with “Hate”, there are still some rather interesting connections revealed.

It also left me curious what both the “Love” and “Hate” networks would look like if they were put together as one system.

I moved forward, creating my final visualization, “Love-Hate”, which combines both datasets and shows the inter-relationships between “Love”, “Hate”, and the words that are connected to them both.

Conclusion

Overall, this was a really fun little project.

Not to mention, the research methodologies involved was a huge learning experience – I’m confident that my next Google N-gram project will take a fraction of the time it took for me to do this first series.

If you’re interested in reading about the nitty-gritty, feel free to take a look at the paper I wrote on the subject,  Visualizing Complex Relationships with Google N-grams.

In the meantime, I’ll still be exploring projects involving words…