During the first month of our Google Summer of Code, I have been working along 3 distinct avenues:
1. Compiling news domains
Coming into the project, we had several lists of questionable domains:
The OpenSources list that I worked with previously (the BS-detector Chrome extension is based on this list)
Guy posted a list from Politifact
We were also looking at using MediaBiasFactCheck.com since they seem to have a very comprehensive list, with categorization that may align with our needs (for ex. least-biased vs right-biased vs left-biased), as well as some information about each source.
I wanted to aggregate all of this information/categorization in one place, so I put together a CSV of all domains from the three sources above (~2k domains), along with the categories assigned by each, any additional comments, etc. It's been interesting to look at the overlap as well as at the discrepancies among these. This file will probably have several applications throughout the course of the summer and will be made available to the general public.
2. Crawling news domains
Later this summer we may end up building one or more text classifier that would classify a news article based on its content (rather than the source where it was published). For example, we may build a classifier for distinguishing sensationalist vs. objective news style, a classifier for detecting right vs. left bias, etc. The first step for any of these endeavors, of course, is to collect data.
I have started to crawl the domains from the compiled file mentioned above. My approach is to tread carefully and thoughtfully in order to ensure "clean", cohesive datasets, rather than to try to automatically crawl all domains and gather as much data as possible. I hand-pick each domain to be crawled, based on information from MBFC, Open Sources, and Politifact, as well as my own judgement - only picking those domains that clearly exhibit characteristics of a potential category (ex. sensationalist, objective, pseudoscience etc.)
I am still in the process of checking the domains and adding them to the crawler. As of today (6/24), I am crawling over 100 domains, accumulating more than 1k articles daily.
3. Source Checker
GSOC mentor Amra Dorjbayar (VRT) pitched an idea for a useful demo tool - a source checker that takes a text, chops into pieces, googles the result, and returns the sources that publish this text, as well as a warning if one of the sources is not reputable. I have started putting together a prototype for this:
Using Pattern's n-gram module, I break the text into n-grams
I discard n-grams that would not be useful for googling, such as n-grams that consist primarily of named entities (ex. 'Rand', 'Paul', 'of', 'Kentucky', 'Ted', 'Cruz', 'of', 'Texas', 'Mike', 'Lee') or of stop-words (ex. 'to', 'being', 'able', 'to', 'boast', 'about', 'the', 'adoption', 'of', 'a')
I pick a random subset of the remaining n-grams and run them through Pattern's Google API
I use Pattern's Intertextuality module to choose only those results that match the text
These results can then be matched against our file of domains, and we can return to the user information about the sources that publish the text, potentially along with some sort of graph visualization
For evaluation, I am using a random subset of the crawled news articles (see above) - I break each article into snippets of various lengths, run each snippet through the tool, and check whether the domain from which the article was crawled matches one of the domains returned by the tool.
Unfortunately, this work got stalled because of Google's API query limit, so the parameters have not yet been tested and tuned. We are currently looking into using a peer-to-peer search engine like Faroo and YACY, as well as into getting budget to continue work on the Google functionality.
Overall, I believe our project is off to a great start, and I am excited to see what we achieve in July and August.