English Profanity and Offensive Word List Constitution
As part of Google Summer of Code 2019, I undertook the constitution of an annotated list of English terms related to the notions of profanity, hatefulness and offensiveness. In this post, I will describe the different steps taken towards building it up and annotating it.
Related code and data can be found here: GSoC 2019 - English Profanity and Offensive Word List
The first step was to determine what technique to use to generate a list that would reflect the aspects being researched. My choice was to use 2 comparable corpora that would have as their main difference the presence/absence of offensive and hateful language, or not. A technique called pointwise mutual information (PMI) can then be applied to see what words are more typical of one corpus relative to the other. It is good at ignoring common and (usually) uninteresting words such as “the”, “an”, etc. while singling out typical terms of a given corpus.
To that end, I used textual data collected from the controversial social media platform gab.com. Gab came in the public spotlight in the aftermath of the Tree of Life shooting where it was then said that the shooter was a gab user and that the platform might have played in a role in his radicalization. Manually going through a couple of posts can quickly give one a hint of why such claims were made, as the platform is filled with openly racist, conspirationist, anti-Semitic and overall hateful and toxic content. It thus seemed like a “good” place to start. I manually selected a few dozens of users that were openly racist and hateful to be scraped in the hope that they would indeed reflect the toxic language I was looking for. In total, around 250,000 posts were retrieved from approximately 60 users over a span of 3 years (from late August 2016, when Gab first came online until late February 2019). The data was cleaned from URLs and usernames, as that data doesn’t convey useful information for our task as well as not being privacy-friendly.
The second step was to collect a reference corpus against which our toxic-language corpus could be compared. The main point when applying such techniques is to find data that is as close as possible to our target corpus, but for that one dimension we are researching, profanity and offensiveness in this case. I thus collected data from another social media platform, i.e. Reddit. The advantage here is that mere Internet slang would be less likely to show up after the comparison of both corpora, which is something that might have been a problem if the reference corpus had been, e.g. the Brown corpus, that is much too standard for our current purpose. A downside, however, is that Reddit, while being more mainstream, moderate and moderated than Gab, is also not free from toxic content and this could lead to some offensive language slipping through. Yet, the platform has recently been taking action against hateful and toxic content by banning posts, users and even entire subreddits deemed inappropriate, so Reddit still felt like a good reference in contrast to Gab. Reddit posts were simply retrieved using a public archive, and there was more than enough data to match that of Gab.
Once both corpora had been put together, we applied a PMI analysis with Gab as the target corpus, and kept the top 2000 words (ranked by PMI score). It yielded rather instinctive results with “Jew”, “nigger”, “kike” (offensive word for “Jew”) and other niceties showing up in at the very top. However, there was also a lot of non-offensive and semi-related terms that showed up such as “America”, “white” or “election” that would be interesting for topic modeling, but that did not entirely fit our purpose. Of course, it also output a lot of entirely unrelated words that would need to be cleaned up during the annotation phase. We thus needed another way to enrich the list.
The idea was to use lexical proximity between words represented as embeddings in a high-dimensional vector space. When applied toa sufficient amount of data, this technique can deliver surprisingly intuitive results. Given that words are represented in a mathematical form, they can be added and subtracted to and from one another, such that “Merkel” – “Germany” + “France” yields “Macron”. Needless to say that such models are powerful tools to capture all sorts of lexical relationships. For our purpose, we trained a basic word-embedding model from our Gab corpus. However, lexical relationships don’t jumped at me out of the blue and I needed seed words with which to compute the lexical proximity within the embedding space. Those were found heuristically by searching the web for lists of insults and rude language in general. We used 2 lists: a list of insults (thus excluding “rude” words such as “fucking”, as it is not an insult) put together collaboratively in “Wiki” format and a “Offensive/Profane Word List” by Luis von Ahn (creator of the language-learning app Duolinguo, among other things).
Each word itself was added to the final list, before being compared to the other words in the vector space using the cosine distance as a means of comparison. The 10 most similar words were kept and their respective distance to the seed word were added to that of previous words retrieved this way. For instance, we used “nigger” as a seed word, yielding “niggar” as a very similar one, and if “niggar” had previously been retrieved, the current cosine distance between “nigger” and “niggar” was added to that of the previous occurrence of “niggar”. In the end, we had generated a list of words mapped to accumulated cosine distances that could be sorted to retrieve the words most commonly associated to insults and other offensive words from our 2 original lists. Adding up the cosine distances of each retrieved word proved useful as the vector space of Gab was trained using a rather small amount of data for such a task (250,000 posts) and this cosine-distance-based retrieval technique also generated noise and irrelevant data.
Each word in the list was then annotated along to 2 axes/dimensions: one representing the level/degree of offensiveness (from 0 to 4) and another reflecting the nature or the topic associated with said word (racial, political, religious, etc.) based on previous work by CLiPS in German and Dutch. Topics were not mutually exclusive and multiple topics can be associated to one word. The manual review of words one by one is the opportunity to get rid of irrelevant words. However, it must be noted that the limit between relevant or not can sometimes be fuzzy, as sensationalist or controversial words (“refugee”, “supremacist”, etc.) can also prove useful. Thus, when in doubt, the word remained in the list, as deleted words cannot be retrieved, while irrelevant words can always be removed later if necessary.
I hope this post was enjoyable to read and gave a good overview on how to filter out specific data by comparison. I think the method described above works well for high-resource languages like English, given the quantitative nature of the techniques involved. Should it be transposed to other (and more specific) topics, as well as to languages less represented online, more precise techniques should be considered.