Text-based fake news detection: Phase II
During Phase 2 of Google Summer of Code, I continued my data-aggregation efforts, developed the Source Checker tool, and trained a model that detects sensationalist news articles.
1. Data Aggregation
Throughout Phase 2, I crawled over 200 domains daily, and continued researching news domains and adding them to my crawler. As of today, I have aggregated over 30k news articles. As I plan to use these articles for classification models, below is the breakdown by each potential class:
Sensationalism Classifier:
Sensationalist: 13k Objective: 8.5k
Bias Classifier:
Right: 12k
Right-center: 1k
Least-biased: 3.5k
Left-center: 2k
Left: 4.5k
2. Source Checker
This is a tool that was requested by GSOC-mentors, @vincent_merckx and @amra_dorjbayar. It takes as input a snippet of text - presumably, a news article or part of a news article. It returns a graph output that shows what types of domains publish the text (or parts of the text)"
The circles correspond to returned domains.
Circle size corresponds to amount of overlap between the input snippet and the domain.
Circle border color corresponds to bias: blue = left, red = right, green = neutral, grey = unknown.
Circle fill corresponds to unreliability: black circles are classified by one of the lists as either fake, unreliable, clickbait, questionable, or conspiracy. The blacker the circle - the more unreliable it is.
Edges that connect circles correspond to overlap of statements - the thicker the edge, the bigger the overlap.
After GSOC ends, we will localize this tool for Dutch articles as well.
Architecture of the tool:
The text snippet is broken down into n-grams using the Pattern n-gram module. N-grams that consist primarily of stop-words or named entities are discarded. A sample of the remaining n-grams is reconstructed into the original strings and run through the Google API as an exact phrase (in quotation marks) . The returned domains are then rated by the amount of queries that returned that domain (more than 6 out of 10 = "high overlap", 3 to 6 = "some overlap", less than 3 = "minimal overlap"), and matched against our database. The graph is rendered using the Pattern Graph module.
3. Sensationalism Classifier
I used the aforementioned crawled data to train a model that classifies a news article as either sensationalist or not. This model currently achieves an F1-score of 92% (obtained through 5-fold cross-validation).
It takes as input a 2-column CSV file, where the first column corresponds to the headlines and second one corresponds to the article texts. The output file contains a third column with the label - 1 if the input is categorized as sensationalist, 0 if not.
The classifier is an SVM, and it uses the following features:
POS tags (unigrams and bigrams)
Punctuation
Sentence length
Number of capitalized tokens (normalized by length of text)
Number of words that overlap with the Pattern Profanity word list (normalized by length of text)
Polarity and subjectivity scores (obtained through the Pattern Sentiment module)