In the second phase of GSoC, I continued annotating political tweets and corrected some typos in the dataset. I created a more varied corpus by collecting tweets related to American, British, Canadian, and Australian socio-political affairs (I am also collecting New Zealand tweets but they are really rare). As regards the annotation guidelines, I improved the document stylistically and added relevant examples to each section. I also created a short appendix containing the most important politicians and their party affiliations, so as to facilitate future annotations.
As for the dataset itself, I am happy to announce that there were far more idioms and proverbs than in the previous stage. The following list presents the top ten most frequent hashtags extracted from the tweets (the figures in brackets represent the relative frequency of respective hashtags):
1. #Brexit (3.93)
2. #TrudeauMustGo (3.23)
3. #JustinTrudeau (3.07)
4. #MAGA (2.99)
5. #Tories (2.53)
6. #Drumpf (2.23)
7. #Corbyn (2.19)
8. #Labour (2.08)
9. #Tory (1.98)
10. #ImpeachTrump (1.73)
Our core set of hashtags (balanced with respect to political bias) was as follows: #MAGA, #GOP, #resist, #ImpeachTrump, #Brexit, #Labour, #Tory, #TheresaMay, #Corbyn, #UKIP, #auspol, #PaulineHanson, #Turnbull, #nzpol, #canpoli, #cpc, #NDP, #JustinTrudeau, #TrudeauMustGo, #MCGA. Many more hashtags are being used but they usually yield fewer results than the above set.
Below are a few figures that aptly summarise the current shape of the corpus: Left-wing bias: ca 55%
Male authors: ca 49%
Polarity: ca 44% negative, ca 47% neutral, ca 9% positive
Mood: ca 50% agitated, ca 21% sarcasm, ca 13% anger, ca 9% neutral, ca 4% joy
Offensive language: present in approximately 17% of all tweets
Swearing by gender: ca 53% males
Speech acts: ca 76% assertive, ca 38% expressive, ca 10% directive, ca 3% commissive, 0.2% metalocutionary
In the third stage I will continue annotating political tweets and write a comprehensive report about the task. My Mentors have also kindly suggested that they could hire another student to provide additional judgments on the subjective categories (especially, polarity and mood). Having more annotators will undoubtedly make the dataset a more valuable resource.