Properties of angry speech

Note: This post contains profanity

Sit down if you’re standing: There’s a lot of angry speech on the internet. There’s a lot of regular speech too. For exact meaning, the order and context of words is critical, but for general tone one can get pretty far just by looking at word choice.

There were two text corpora analyzed here:

0. The “Blog Authorship Corpus” [1]

1. The text from an internet rant site.

“Rant” sites are an anonymous way for people to vent their anger online, the theory being it is cathartic to express that anger in a safe setting. It also provides a nice dataset of text which is guaranteed to have much more anger than normal.

Methods: The blog corpus had formatting stripped anyway. I lowercased all the text, removed punctuation, and simply counted all the unique words. No spelling correction or stemming was done. Words in the Python Natural Language Toolkit (nltk) “stopwords” corpus were removed.

Wordclouds from each are shown below:

Word Cloud from Blog Corpus

Word Cloud from Blog Corpus

Word Cloud from Rant Site

Word Cloud from Rant Site

It’s a bit hard to interpret these, so here are bar charts from the top 20 words:

Top 20 words from Blog Corpus

Top 20 words from Blog Corpus

Top 20 Words from Rant Site Corpus

Top 20 Words from Rant Site Corpus

See the difference? It took me a minute too. For the most part they are very similar. “fuck”, “fucking”, and “hate” show up in the rant corpus and not in the blog corpus, leading to the unsurprising conclusion that people use the words a lot more when they’re angry.

To better illustrate the difference, I ranked each word, took the difference between ranks, and plotted those which had the highest difference. For instance, if “fuck” was the 10th most commonly used word in the rant site, and 200th most commonly used word in the blog corpus, it would have a difference of 190. The words with the highest rank difference are shown below.

Difference in Rank Between Rant Corpus and Blog Corpus

Difference in Rank Between Rant Corpus and Blog Corpus

The top differences are colloquialisms. “be” would be stripped out as it is a stopword, but “bee” would not. This could reflect a difference in the userbase of these two sites rather than the emotional state of the users. Both corpuses contained “people” in their top twenty words, but “peeps” was much more frequently used in the rant site corpus.

Conclusions? I was surprised at how similar these two corpora were. It could be that bloggers are angrier than I give them credit for, or it could be that the hallmarks of angry speech are subtler than I expected. Or that people don’t bother to find creative ways to say “stupid fucking shit”.

-Jacob

  1. [1]

    J. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs. (pdf)

    http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm

This entry was posted in Social Media, Text Mining. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *