Maja Gwozdz's blog

Final Report: The MAGA Corpus


The MAGA* Corpus of Anglophone Political Debate Online is now available. The project page can be viewed here. A full paper is attached to this post and the MAGA corpus and annotation guidelines can be consulted on the aforementioned project page or on this site under the Resources tab.

*Make Annotations great again


Phase 2 - Report: Maja Gwozdz

In the second phase of GSoC, I continued annotating political tweets and corrected some typos in the dataset. I created a more varied corpus by collecting tweets related to American, British, Canadian, and Australian socio-political affairs (I am also collecting New Zealand tweets but they are really rare). As regards the annotation guidelines, I improved the document stylistically and added relevant examples to each section. I also created a short appendix containing the most important politicians and their party affiliations, so as to facilitate future annotations.

As for the dataset itself, I am happy to announce that there were far more idioms and proverbs than in the previous stage. The following list presents the top ten most frequent hashtags extracted from the tweets (the figures in brackets represent the relative frequency of respective hashtags):

1. #Brexit (3.93)

2. #TrudeauMustGo (3.23)

3. #JustinTrudeau (3.07)

4. #MAGA (2.99)

5. #Tories (2.53)

6. #Drumpf (2.23)

7. #Corbyn (2.19)

8. #Labour (2.08)

9. #Tory (1.98)

10. #ImpeachTrump (1.73)

Our core set of hashtags (balanced with respect to political bias) was as follows: #MAGA, #GOP, #resist, #ImpeachTrump, #Brexit, #Labour, #Tory, #TheresaMay, #Corbyn, #UKIP, #auspol, #PaulineHanson, #Turnbull, #nzpol, #canpoli, #cpc, #NDP, #JustinTrudeau, #TrudeauMustGo, #MCGA. Many more hashtags are being used but they usually yield fewer results than the above set.

Below are a few figures that aptly summarise the current shape of the corpus: Left-wing bias: ca 55%

Male authors: ca 49%

Polarity: ca 44% negative, ca 47% neutral, ca 9% positive

Mood: ca 50% agitated, ca 21% sarcasm, ca 13% anger, ca 9% neutral, ca 4% joy

Offensive language: present in approximately 17% of all tweets

Swearing by gender: ca 53% males

Speech acts: ca 76% assertive, ca 38% expressive, ca 10% directive, ca 3% commissive, 0.2% metalocutionary

In the third stage I will continue annotating political tweets and write a comprehensive report about the task. My Mentors have also kindly suggested that they could hire another student to provide additional judgments on the subjective categories (especially, polarity and mood). Having more annotators will undoubtedly make the dataset a more valuable resource.

Phase 1 - Report: Maja Gwozdz

In the first phase of GSoC 2018, I started annotating political tweets. The corpus of political tweets includes, for instance, tweets related to US, Canadian, UK, Australian politics and current social affairs. The categories included in the database include information about the author, their gender, the political bias, the polarity of a given entry (I'm using a discrete scale: -1 for a negative utterance, 0 for a neutral one, 1 for a positive entry), speech acts, mood of the tweet (for instance, sarcasm or anger), any swear words / offensive language, and the keywords, that is, concrete parts of the tweet that led to the polarity judgment.

In order to obtain the relevant political tweets, I used Grasp and a list of popular political hashtags (to mention but a few: #MAGA, #TrudeauMustGo, #auspoli, #Brexit, #canpoli, #TheresaMay). I also prepared the annotation guidelines, so that other people interested in the project could offer their own judgment and provide additional annotations. Having more judgments will render the corpus more valuable. In the next stage of GSoC, I hope to have enough judgments from other people to estimate the agreement score and arrive at (more) objective scores.

The database is currently available as a Google Sheet --- this is a relatively easy way to store data and allow for parallel annotation.