Pierre Voué's blog

Word embeddings trained from 4chan and 8chan data

In this post, we will explain how we collected data from the imageboard forums 4chan and 8chan and how we trained word embeddings from them as well as an experiment illustrating their academic potentials.

Related code and files may be found here: GSoC 2019 - 4chan and 8chan Word Embeddings

4chan and 8chan are either popular or obscure places of the Internet depending on your knowledge of the Internet ecosystem. They now and then attract attention from different media outlets due to their potential ties with certain shootings and terrorist attacks, most often associated with white supremacism and neo-Nazism. Most recently, 8chan was for instance associated with the Christchurch shooter Brenton Tarrant, who allegedly posted a livestream link and a manifesto prior to his attack. The platform made the headlines again with the El Paso shooting and was thus, as of writing (22/08/2019), definitely shut down in the aftermath of the toil it sparked. We hence thought fit to gather data about them so as to be able to perform sound research on those controversial platforms.

Both platforms are very similar: they allow their users to create “threads” on a “board” (a sub-forum centered around a certain topic) to post content in an anonymous way, to which other users can respond. Their main difference lie in that 4chan has a set number of boards dedicated to different subject matters, whereas 8chan allows users to create their own boards around the topic of their choosing, leading to a much higher of boards, but also to much more sparse content. However, because of their structural similarity, they share boards centered around the same topics and we chose to investigate the so-called “/pol/” board, i.e. the board dedicated to discussing (international) politics. On both forums, /pol/ is a popular board and known for hosting what we could call “toxic” content: racist, fascist posts along with a wide variety of otherwise doubtful content, the virulence of which is probably fueled by the ‘absolute’ anonymity allowed on the platform.

We thus proceeded to collect the data from the /pol/ board of both platforms. Each platform possesses an API through which we can request thread or board-related data in a programmatic way. However, another specificity of those platforms is that they regularly clean up older content and only retain most recent threads so that, despite a limited archiving mechanism from the site itself, the content available through that channel is too limited for the purpose of training sufficiently good word embeddings. As a consequence, we looked for other means of gathering the relevant data. Luckily, multiple sites are dedicated to archiving the content posted on 4chan, but we have not managed to find one for 8chan. For 4chan, we found archive.4plebs.org, that reaches back to late 2013 and, concerning 8chan, we resolved to collect the little information available to the site itself. 4plebs has not set up any API, so we decided to collect the data through scraping.

The collection of 8chan data was rather quick as, as mentioned above, the site does not host a large amount of data at any given time and the fact that 8chan seems to attract less users than 4 chan. Concerning 4chan, however, the quantity of data available for /pol/ alone is rather impressive. We set out crawler running from beginning of July until the end of July and collected approximately 30 million entries spanning over 6 years (between 2013 and 2019) for both 4chan and 8chan combined. However, given our crawling scheme that let 2 scrapers run in opposite directions (i.e. from past to present and from present to past), some data are missing for the years 2015 and 2017.

Once the data was collected, we evidently cleaned it through some pre-processing steps including removing URLs and reposts, as well as tokenizing it. We reflected for some time over how to best approach the training of the word embeddings: should we remove or keep stop words (common words (e.g. “the”, “about”, “on”, etc.) that might be detrimental to certain tasks, what window to use, should we use CBOW or skip-gram, etc. Also, we had to decide with what package to work with and we elected to go for genism, a popular Python library optimized for the training of dense word embeddings. It appeared that the default parameters were, as it stands, rather well-suited for our task and we stuck to them.

We then set up a toy experiment to show a potential use of those embeddings: we compared the distance between two given words in the 4/8chan vector space, and the distance between the same two words in another vector space trained from Reddit data. We won’t re-go through the whole description of the training for Reddit word embeddings. Let us simply note here that we collected the Reddit data from the archiving site result redditsearch.io. The words we chose put under scrutiny (henceforth ‘target words’) are the 50 first words of the English Profanity and Offensive Words (POW) list, that we generated and annotated as part of another project of this GSoC. The words against which they were compared in each vector space were very basic ones, conditioned on their part of speech tag (ADJ, NOUN or VERB), to give us a basic idea of the different representations of the target words. For each target word, it was compared against 2 words of the same grammatical category, a positive and a negative one. For adjectives, we had ‘good’ and ‘bad’, for nouns ‘human’ and ‘monster’ and for verbs ‘love’ and ‘hate’.

Let us take 2 target words to illustrate how it worked: ‘jew’ and ‘communist’. As a noun, ‘jew’ was compared to both ‘human’ and ‘monster’ in both vector spaces (making 4 comparisons in total) using their cosine similarity. The result is as follows:

‘human’ in Reddit: 0.27 VS ‘human’ in 4/8chan: 0.08
‘monster’ in Reddit: 0.32 VS ‘monster’ in 4/8chan: 0.31

For the target word ‘communist’ (ADJ):

‘good’ in Reddit: 0.12 VS ‘good’ in 4/8chan: -0.03
‘bad’ in Reddit: 0.21 VS ‘bad’ in 4/8chan: 0.04

Evidently, there are a lot of methodological shortcomings to the toy experiment above, but this is not intended as academic work, for now. The aim was to show the possible use of those embeddings and possible trends to be explored in language representation for each community. We hope this post gave a clearer view of the process of gathering data from platforms such as 4chan and 8chan, as well as the use that can be made of the word embeddings resulting from such data.

French Profanity and Offensive Word List Constitution

For the second project we worked on during GSoC 2019, we decided to transpose the work we had done before, i.e. we replicated the generation and annotation of a list of profanity and offensive words (POW) for the French language. We should already note that, despite the designation of such lists, they also encompass the potentially polarizing dimension of certain words in addition of the offensive and profane one. Below, we will explain the data and techniques leveraged to generate it.

Related code and files may be found here: GSoC 2019 - French Profanity and Offensive Word List

The first step to find POW is to find a dataset/corpus of data containing significant quantities of those. We thus started looking for online sources that fit that description and ended up with 2 paths to explore. The first one was a specific section of the French-speaking videogame website “jeuxvideo.com” (JVC) called “Blabla 18-25 ans” (literally “Chitchat 18-25 years”). Even though it might sound like a benign and perfectly innocent forum, it has gained attention over the recent years for being a rather toxic part of the whole forum, one teeming with Pepe the Frog gifs, trolls or even straight up right-wing or Islamist radicals. French newspapers have been reporting on this phenomenon for a few years, linking such online activity to political activism and the recent success of the French populist right-wing party “Front National” (now rebranded “Rassemblement National”). Manual exploration of the forum also revealed that the toxicity of 18-25 had carried over to the sub-forum dedicated to discussing politics. Both of those sub-forums seemed like promising places to find POW and words indicative of polarized content.

The second identified source of data was a French-speaking board of the imageboard 8chan. 8chan is similar to the near eponymous 4chan, with the difference that users are allowed to create their own new boards on the site, much like users can create subreddits on Reddit. As a consequence, 8chan is full of niche boards, centered around specific topics, communities or languages. One in particular came to our attention after our browsing the Internet on the lookout for extremist websites or forums. While we were exploring the reactionary, identarian and right-wing extremist website “Démocratie Participative” (“Participative Democracy”), we noted that an 8chan board called “dempart” was featured on the site. After exploring the board for some time, we noticed it was fit for our task both in terms of data quantity and quality.

After the identification of those 2 sources, we proceeded to gather data from it by scraping them. From JVC, we arbitrarily decided to scrape around 200 pages from the 18-25 as well as the political subforum. Collecting all the data present there would yield an absurd amount of data since the site has been in activity for more than 10 years. The 8chan board “dempart” was, for its part, fully scraped as it did not exhibit big volumes of content. Out of curiosity, we also tried looking for the hashtag “#dempart” on Twitter and it yielded posts effectively discussing genuine participative democracy, having no ties with the racist group of the same name.

We then used the same technique as for the English POW list, i.e. the statistical metric of pointwise mutual information (PMI) to filter the relevant words from a target corpus compared to a reference corpus. Our reference corpus was a one of 88.000 French text messages published in an open-source format to reflect colloquial French. We noticed however that including only the dempart data from 8chan yielded more relevant results than with the JVC 18-25 data so we ended up no using the latter.

Finally, we also included in the general list smaller lists of less common insults sourced from heuristic Internet search. We manually reviewed them to ensure they were fit for our purpose before including them. Those sources include cruciverbists’ (crossword puzzles enthusiasts) lists as well as a one that used to be present in Android phones’ dictionaries with a special flag for words suspected to be offensive.

English Profanity and Offensive Word List Constitution

As part of Google Summer of Code 2019, I undertook the constitution of an annotated list of English terms related to the notions of profanity, hatefulness and offensiveness. In this post, I will describe the different steps taken towards building it up and annotating it.

Related code and data can be found here: GSoC 2019 - English Profanity and Offensive Word List

The first step was to determine what technique to use to generate a list that would reflect the aspects being researched. My choice was to use 2 comparable corpora that would have as their main difference the presence/absence of offensive and hateful language, or not. A technique called pointwise mutual information (PMI) can then be applied to see what words are more typical of one corpus relative to the other. It is good at ignoring common and (usually) uninteresting words such as “the”, “an”, etc. while singling out typical terms of a given corpus.

To that end, I used textual data collected from the controversial social media platform gab.com. Gab came in the public spotlight in the aftermath of the Tree of Life shooting where it was then said that the shooter was a gab user and that the platform might have played in a role in his radicalization. Manually going through a couple of posts can quickly give one a hint of why such claims were made, as the platform is filled with openly racist, conspirationist, anti-Semitic and overall hateful and toxic content. It thus seemed like a “good” place to start. I manually selected a few dozens of users that were openly racist and hateful to be scraped in the hope that they would indeed reflect the toxic language I was looking for. In total, around 250,000 posts were retrieved from approximately 60 users over a span of 3 years (from late August 2016, when Gab first came online until late February 2019). The data was cleaned from URLs and usernames, as that data doesn’t convey useful information for our task as well as not being privacy-friendly.

The second step was to collect a reference corpus against which our toxic-language corpus could be compared. The main point when applying such techniques is to find data that is as close as possible to our target corpus, but for that one dimension we are researching, profanity and offensiveness in this case. I thus collected data from another social media platform, i.e. Reddit. The advantage here is that mere Internet slang would be less likely to show up after the comparison of both corpora, which is something that might have been a problem if the reference corpus had been, e.g. the Brown corpus, that is much too standard for our current purpose. A downside, however, is that Reddit, while being more mainstream, moderate and moderated than Gab, is also not free from toxic content and this could lead to some offensive language slipping through. Yet, the platform has recently been taking action against hateful and toxic content by banning posts, users and even entire subreddits deemed inappropriate, so Reddit still felt like a good reference in contrast to Gab. Reddit posts were simply retrieved using a public archive, and there was more than enough data to match that of Gab.

Once both corpora had been put together, we applied a PMI analysis with Gab as the target corpus, and kept the top 2000 words (ranked by PMI score). It yielded rather instinctive results with “Jew”, “nigger”, “kike” (offensive word for “Jew”) and other niceties showing up in at the very top. However, there was also a lot of non-offensive and semi-related terms that showed up such as “America”, “white” or “election” that would be interesting for topic modeling, but that did not entirely fit our purpose. Of course, it also output a lot of entirely unrelated words that would need to be cleaned up during the annotation phase. We thus needed another way to enrich the list.

The idea was to use lexical proximity between words represented as embeddings in a high-dimensional vector space. When applied toa sufficient amount of data, this technique can deliver surprisingly intuitive results. Given that words are represented in a mathematical form, they can be added and subtracted to and from one another, such that “Merkel” – “Germany” + “France” yields “Macron”. Needless to say that such models are powerful tools to capture all sorts of lexical relationships. For our purpose, we trained a basic word-embedding model from our Gab corpus. However, lexical relationships don’t jumped at me out of the blue and I needed seed words with which to compute the lexical proximity within the embedding space. Those were found heuristically by searching the web for lists of insults and rude language in general. We used 2 lists: a list of insults (thus excluding “rude” words such as “fucking”, as it is not an insult) put together collaboratively in “Wiki” format and a “Offensive/Profane Word List” by Luis von Ahn (creator of the language-learning app Duolinguo, among other things).

Each word itself was added to the final list, before being compared to the other words in the vector space using the cosine distance as a means of comparison. The 10 most similar words were kept and their respective distance to the seed word were added to that of previous words retrieved this way. For instance, we used “nigger” as a seed word, yielding “niggar” as a very similar one, and if “niggar” had previously been retrieved, the current cosine distance between “nigger” and “niggar” was added to that of the previous occurrence of “niggar”. In the end, we had generated a list of words mapped to accumulated cosine distances that could be sorted to retrieve the words most commonly associated to insults and other offensive words from our 2 original lists. Adding up the cosine distances of each retrieved word proved useful as the vector space of Gab was trained using a rather small amount of data for such a task (250,000 posts) and this cosine-distance-based retrieval technique also generated noise and irrelevant data.

Each word in the list was then annotated along to 2 axes/dimensions: one representing the level/degree of offensiveness (from 0 to 4) and another reflecting the nature or the topic associated with said word (racial, political, religious, etc.) based on previous work by CLiPS in German and Dutch. Topics were not mutually exclusive and multiple topics can be associated to one word. The manual review of words one by one is the opportunity to get rid of irrelevant words. However, it must be noted that the limit between relevant or not can sometimes be fuzzy, as sensationalist or controversial words (“refugee”, “supremacist”, etc.) can also prove useful. Thus, when in doubt, the word remained in the list, as deleted words cannot be retrieved, while irrelevant words can always be removed later if necessary.

I hope this post was enjoyable to read and gave a good overview on how to filter out specific data by comparison. I think the method described above works well for high-resource languages like English, given the quantitative nature of the techniques involved. Should it be transposed to other (and more specific) topics, as well as to languages less represented online, more precise techniques should be considered.