During the initial phase, I have been doing a lot of work involving images of both edited images and website appearance, to see if we can statistically model what a "fake news" site might look like. To find edited images, I have been using a method called Error Level Analysis, which can detect different levels of compression in JPEG images. The technique has been very effective so far in finding edited images using a training set from reddit's r/photoshopbattles, although it has taken quite some time to collect a training set from this source.
Using ELA, I have trained a random forest classifier to quite accurately detect edited / non edited images, which will be an input to a meta classifier that we will develop later this summer.
On the website analysis part, I have been using Masha's excellent sources list + PhantomJS to take screen shots of credible news as well as historically incredulous news sites. Again, the training set has been the biggest hurdle to overcome, but progress is good. While these two features may not be indicative of a fake news / real news article, they have seemed to be very good indicators of fake / real news in preliminary analyses. As we train our metaclassifier in the coming months, I see us weighing the NLP features much more than the images, but using image-based features as a way to verify or confirm our beliefs when we are on the fence about how to automatically classify images.
I would like to see myself doing some more NLP work in addition with the image processing stuff, as that (as soon as we finish and clean up our training data) will be done in a few days and be ready for implementation alongside a metaclassifier. I am eager to reconvene with the team to see where I can help with more textual based analysis, and I am so excited to see what we can accomplish in the coming months!