Sexism detection in Russian language
Considering, that this report is quite long, I would recommend to read it here, where it is splitted into chapters, which supposingly should make the reading process easier.
The repository of the project itself can be found here and the description of the content of the repository can be found further down
N.B.: I decided to avoid the 4-page-long arxiv publication template, for the sake of completion and full documentation of the process. The references are given in the end of each of the subtopics.
In the course of work on the issue of "hate speech", two compilations have been made, both of which may be useful for further research in this area. The first one is an attempt to systematize research on hate speech. This file will be updated.
The second one is a compilation of known open source corpora on hate speech. The list includes more than ten of them. This file is not planned to be updated.
1. The definition of “hate speech” and problems and difficulties of the task.
Hate speech is difficult to define, and even after agreeing on some sort of definition, is still proving to be complicated to attribute something (a tweet or a message) to the hate speech.
In the article "Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis" authors noted that the agreement between the annotators of their hate speech corpus (measured with Krippendorff's alpha coefficient) ranged from 0.18 to 0.29, much lower than recommended coefficient of 0.8.
Even more important is that after the annotators were acquainted with the definition of hate speech, a significant improvement of the coefficient did not follow. Similar thoughts are found in the paper “Are You a Racist or Am I Seeing Things?", where the author faced a similar problem. In this article an attempt was made to compare the decisions from expert and amateur annotators. The mutual influence on attribution decisions was compared, and special attention was paid to annotation of tweets in two different corpora (one of them was made specifically for the purposes of the article, another was made for the article “Hateful symbols or hateful people? predictive features for hate speech detection on twitter”). By how different the labels turned out to be, one can draw additional conclusions about subjectivity of attribution attempts. The author of the article claims, that annotations made by amateur annotators should be accepted only if there is a complete agreement. They also support previous authors in the idea that the process of attribution is complicated without the intimate knowledge of the topic.
Similar ideas can be found in the article “Abusive Language Detection in Online User Content”, where the authors conducted so-called “Amazon Turk Experiment ”, where they hired several amateur annotators (not allowing each of them to annotate more than 50 text entities). The annotators were acquainted with guidelines, used by expert annotators. The agreement rate was acceptable only for the binary classification (coefficient of 0.867), but it dropped significantly for the categorical task. That could suggest that either there is a need for more extensive training, or that amateurs are much worse suited for the annotation task.
Some of the problems in the attribution can arise because of the focus on the word level, instead of tweet/message as an entity itself. This focus on words can be problematic in two ways.
First, the problem of attribution something to the “hate speech” category is not limited to the question what is a hate speech. The article “Automated Hate Speech Detection and the Problem of Offensive Language” demonstrates how mutually confusable are hate speech and offensive language.
Another problem is connected with the sarcasm or quotes of the opponent arguments, and it will be mentioned later.
In this work we hoped that because such specific area of the problem is picked (sexism detection) and because the annotator is familiar with the problem on more personal level, it will be easier to distinguish between the categories. Nevertheless, several problem arose. It is also worth noting, that while we were in the process of research, and first thought about just picking “hate speech” as the topic, we made a list of many publicly available or semi publicly available corpora. They are mainly in English, and there could be different annotation approaches, but it could be found helpful later for some further hate-speech research. You can find the whole list here.
References:
All references are provided with links to the articles
Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated Hate Speech Detection and the Problem of Offensive Language. 11th International Conference on Web and Social Media (ICWSM), pages 512–515, Montreal, Quebec, Canada
C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, and Y. Chang. 2016. Abusive language detection in on-line user content. In 25th International World WideWeb Conference, WWW 2016, pages 145–153
Bjorn Ross, Michael Rist, Guillermo Carbonell, Benjamin Cabrera, Nils Kurowsky, and Michael Wojatzki. 2016. Measuring the reliability of hate speech annotations: The case of the European refugee crisis. Bochum, Germany, September.
Zeerak Waseem. 2016. Are you a racist or am I seeing things? Annotator influence on hate speech detection on Twitter. 1st Workshop on Natural Language Processing and Computational Social Science, pages 138–142.
Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. NAACL Student Research Workshop, San Diego, California, June. Association for Computational Linguistics.
1.2. The process of collecting corpora.
1.2.1 Media sources
The first source was the Russian social network Vkontakte (avaliable at https://vk.com/ ). It is perfectly equipped for the needs of developers, right inside it is provided with an API with detailed documentation.
We chose three different popular groups in vk, representing three different Russian media. They all have more than 500,000 subscribers, and ideologically they represent different directions. The media we have chosen are:
"Lentach" (https://vk.com/oldlentach) is a highly oppositional media in the past, now with a slightly less obvious focus, especially interesting because of the huge number of subscribers (over 2 million). Comments are cleaned by the bot: comments shorter than five words are prohibited, which members of the community are fighting with the help of padding - for example, writing five words comments like: "bot blow [me] three four five". Also in the comments obscene words are prohibited, but but not too ingenious, which gives rise to a lot of obscene vocabulary, where words are spelled in the opposite direction or letters are omitted. The rest of the restrictions seem insignificant to us.
"Medusa" (https://vk.com/meduzaproject) is a liberal, oppositional media (more than 500,000 subscribers); the only media that promotes feminist views (at least on words). However, there may also be very long sexist debates in the comments. (Comments are open to all)
"RT News in Russian" (https://vk.com/rt_russian) - part of Russia Today, basically just automatically post articles published on the portal. Despite the impressive list of rules for commentators, comments are not particularly moderated. Since there are so many posts, the comments are not so numerous, despite the large number of subscribers (over a million).
We approached all these media in the same way, trying to find the comments we are interested in. This function was responsible for collecting the corpus (it can be found in this file):
def make_corpus(name,community, query_list, service_token,vk_api_vers):
We used it to find posts with any of the words in the query list because we assumed that it was the news related to these topics that would cause the most discussion. Here we also give the translation of the query sheet:
query_list= {'sexism', 'meToo', 'sexual harassment', 'decriminalization of domestic violence', 'rape', 'feminism', 'Shurygina', 'harassment' }
For each post we took only the first hundred comments: quite often news posts were compilative (included several different news), and then the whole dialogue could be devoted to something else. Often, this also turned out to be a type of hate speech, for example, our corpus contains a rather long racist dialogue related to Yakutia. In the resulting corpus we wrote the post id, comment id, label (by default, it just had "sexist" as label, and then I manually checked the text of the comment). When I checked each comment, I marked the label as one of three possible ones: "sexist", "not sexist" and "sexist in context". The latter category was used for messages that were not sexist in isolation from the context, but worked in this way in conjunction with the post. I did not include them in my train/test datasets, but they also present an interesting potential for analysis. (Read more about this in the last part of our report).
1.2.2. Forum sources.
The main source of sexist comments was the Antibab website (avaliable at https://antiwomen.ru/index.php), the name of which can be literally translated as "Anti-Women". There, a group of users (presumably mostly male) discusses women and their ultra-patriarchal views of life.
A lot of material was collected, more than 10,000 posts, but I decided to reduce the sample of messages that were taken from this forum: people there have very specific vocabulary, slang and manner of speech (+ the forum is mostly for people after thirty), so there has always been a danger that instead of detecting sexism, model will simply detect the users of this forum. To balance this, I found there a couple of topics with mostly non-sexist comments. I used them to get enough non-sexist material with a similar manner of speech. Data collection was the responsibility of the method:
def make_corpus_ant_forum(name,link_to_topic):
written by me, using Beautiful Soup. The method needs the name of the new file, where the corpus gonna be saved, and a link to the topic to scrape.
The main source of non-sexist speech was the Holywar forum (avaliable at https://holywarsoo.net/index.php), where I extracted a large topic devoted to family relations (problems with parents, close relatives, etc.). This was once again done in the hope that even at the level of the data it will be possible to balance the sexist and non-sexist. It would not be desirable for the model to consider the mentioning of women as sexist, for example. The method to extract this data worked similarly to the previous one.
1.2.3. The description of resulted corpora.
All of them can be found in the folder gsoc2019_crosslang/russian sexist corpora/annotated/
Name of the corpora | Description |
---|---|
ant_1.csv | Non-sexist comments (with some exception) of the antibab-forum |
ant_2.csv | Sexist comments (with some exception) of the antibab-forum |
media_1.csv | Non-sexist and sexist comments from Lentach |
media_2.csv | Non-sexist and sexist comments from Medusa |
media_3.csv | Non-sexist and sexist comments from Russia Today |
ns_1.csv | Very big, purely non-sexist corpus from Holywar forum |
In the end we have 2577 annotated sexist comments and 21526 non sexist comments, which makes our corpus unabalanced, to some extent.
1.3. The process of collecting corpora.
1.3.1. Sarcasm.
One of the main problems was in the distinguishing between sarcasm and seriously meant sexist speech. Example:
“Aren, такую хрень несешь. Еще скажи, что мужик полигамен и должен осеменить больше самок, потому что так заложено природой.»
Translation:
"Aren, you’re saying some bullshit. What next, are you going to say that all men are polygamous and should inseminate more females, because it is in the nature?”
The comment is not sexist by itself, but it contains sexist vocabulary. Here the problem could be solved with the words co-occurrence: the word “bullshit” could work as a signal not to take the message literally.
But another example proves to be even more complicated. This message is a reaction to the news that fewer female football fans will be showed on TV during the matches to avoid objectivization:
“Извините, но это какой пиздц. Может вообще женщинам запретим матчи посещать? Ну чтоб прям наверняка? Можно и в паранджу всех закутать. Это вообще 100% вариант.”
Translation:
"I'm sorry, but that's fucked up. Maybe we should ban women from attending matches at all then? Just to be sure? You can also dress them all in burqas. That should work 100%.”
1.3.2. Special slang.
The forum, which was used as the main source of sexist speech, has very specific slang, not typical to Russian language overall. It could be problematic, because the system shouldn’t be limited just to specific small group of people. The compensation of it was attempted with the use of non-sexist topics (discussion about the phones and attempts to agree on the real life meeting date) of the same forum, to ensure that there would be enough of the non-sexist material with the same manner of speech.
1.2.3. Single annotator.
Of course, considering that we had a single annotator, the labeling process was far from perfect and quite subjective. Even though the definition and guidelines were discussed in details with my mentor, some cases were still providing a challenge, and possibly imperfect solutions were picked in the end. (More about it in the guidelines part) That is also the reason, why I can’t recommend data right now for immediate use. It still needs several of other annotators to look at it, preferably, familiar with sexism on the expert level.
1.4. The guidelines.
- Any generalizations that demean or degrade the female sex were considered sexist.
1.1. Sexist statements and generalizations made by women were also considered sexist.
Sexism in dialogues with women.
2.1. The use of unmotivated obscene vocabulary specifically emphasizing the female sex of the interlocutor were considered sexist. (Words as "whore").
2.1.1. Racist curses or swearwords implying a low intellect of the interlocutor were not considered sexist. (Except when it was neighboring or derived from some generalization).
2.2. Another sign of sexist comment was, when the interlocutor, in an attempt to insult a woman, attacked exclusively her appearance. It was considered hidden sexism.
2.3. Victim-blaming (aimed at the victim or during the talk about the victim) was considered a case of sexism.
Sexism and politics.
3.1. Criticism of feminism or simply aggressive statements about feminism were not considered sexist, because they could be politically motivated.
3.2. Criticism of female politicians was not considered sexism until it was reduced to criticism of appearance or generalizations about women in politics.
3.3. A special case: the infamous story of Diana Shurygina's rape, which became particularly popular through television. Every piece of news about Miss Shurygina caused a lot of comments, and many people speculated that she was not a victim.
This has always sparked a lot of discussion, including a lot of sexist rhetoric. The news portals raised this topic so often (even in a few years after the incindent), that with every news item mentioning Diana Shurygina, a large number of comments were made that negatively responded to the fact that the news about her continued to be published. We did not consider these comments were sexist, nor did we think it was sexist to use her name as a denominator as long as it was meant to be "hyperboleted news"[1]. If her name was used as a denominator for a rape victim, the comment was considered sexist.
In retrospect, this may not have been the best solution. This is one of those cases where the look of another annotator would have been very useful.
[1] - in the sense that news were abusing the story and trying to spark more and more debates, if she guilty or not. (AFTER the court's decision)
Sexism in dialogues with men.
4.1. We did not mark every comment that had sexist expressions as sexist. (See Part 1, on the difference between offensive speech and hate speech)
4.1.1. Therefore, for example, among the comments on the Antifemale forum, a lot of comments are marked as non-sexist, unless the purpose of the comment at the time was some kind of offensive generalization.
2.1. Preprocessing
The preprocessing.py file was responsible for the preprocessing of the text. At first I planned to divide the date into test and train by myself and wrote a separate function for this purpose. Later on, it started to seem as not the be the best way to deal with it, that's why in the end I still used the function from the library sklearn to separate the date.
But nevertheless this kind of handling the distribution of comments proved to be useful when preparing for saving processed corpora (it helped to process the data in small portions).
In the end number of functions were written to preprocess the text, mostly they were using Russian support of nltk. They were functions to strip the punctuation, extract quotes and references using regular expressions, extract stopwords, and perform lemmatization and stemming. (However, it seems to me that the later ones are not very useful in the case of Russian hate speech detection). All the resulting texts were saved in the folder "TemporalCorpora" to facilitate the process of preprocessing for those who wish to use this corpus for their research.
2.2. Attempted models and their results.
I ambitiously planned three different ways of embedding, and several models of different complexity. I started with a simple tf-idf as an embedding and combined it with a naive bayes classifier.
Since our corpus is very imbalanced, instead of calculating f1-score, I used a balanced accuracy score as a metric, which is recommended in such cases. Just in case, we compared the results at different stages of text's preprocessing and then compared the use of logistical regression instead of naive bayes classifier. The results can be found in the table below.
The difference in result between types of preprocessing was not significant. The only interesting thing was, that it seems that the result is surprisingly worse when both punctuation and stop words disappear from the text. This probably deserves a separate analysis in the future.
Obviously all the results below are not particulary stable and set in stone: the accuracy in this case seemed more like an interval. I dealt with it with possibly slight profanity: just got results five times, each time and found the mean, in attempt to catch the logic behind change.
type | tf-idf + NB | tf-idf + Logistic Regression |
---|---|---|
no preprocessing | 63% | 62% |
minus quotes and references | 64% | 64% |
minus punctuation | 61 % | 64% |
minus all above | 62% | 62% |
lemmatization + all above | 66% | 63% |
In the last few days, while I was in a hurry to finish this report, I had a late but interesting idea. I realized that I could still make my corpus less imbalanced, which could change the results somehow.
To do this, I excluded completely non-sexist corpus (ns_1.csv) from my final test/train. My corpus continued to be imbalanced: I still had to use a balanced accuracy score as a metric, but now the data was more like 2,000 to 8,000, not 2,000 to 20,000.
The result was instantaneous and significantly improved. Unfortunately, the idea came to me too late, so I couldn't test it as thoroughly as I wanted, but the results can still be found in the table below.
type | tf-idf + Logistic Regression |
---|---|
no preprocessing | 73% |
minus quotes and references | 71% |
minus punctuation | 70% |
minus all above | 69% |
lemmatization + all above | 74% |
More advanced type of embeddings, which I have attempted, are ELMO. Our architecture was quite simple: we used ELMO embeddings - first, we used pretrained embedding in Russian, then trained it on our own (our resulted options and weight file can be found here in the repository), then plugged the results into LSTM, then to the simple feedforward neural network of one layer.
Because the training of ELMO took something which seemed like a million hours, I couldn't play around much with data and also only tried it exclusively on my very imbalanced corpus. Nevertheless the results were relatively good.
type | (finetuned by me) ELMO | (pretrained and finetuned by me) ELMO |
---|---|---|
no preprocessing | 67% | 74% |
Both numbers are given in the same balanced accuracy score.