Final Report: Pattern 3
Project overview
During the Google Summer of Code event I was focused on the development of the Pattern 3 framework. Pattern consists of many modules which help users to work with web data, use machine learning algorithms, apply natural language processing technics and many other useful functions. The main task was to complete porting Pattern to Python 3 and refactor code. The problem of fixing bugs was also important and as a result all tests in the automatic testing system Travis CI are executed successfully. The new functions and modules have also been added. The pattern.ru module allows users to work with Russian texts. The pattern.web module was improved by adding the VKontakte API class.
Main completed tasks
- Compiling libsvm and liblinear binaries for macos, ubuntu and windows and adding them in pattern to make pattern.vector work out of box.
- Refactoring social media Twitter API.
- Testing all modules, fixing bugs and Travis CI tests.
- Creating VKontakte API class which allows users to get information from the biggest Russian social network. With this you can retrieve user's profile description and profile picture, user's posts from the profile wall and posts from the newsfeed for a search keyword.
- Creating pattern.ru module and collecting the necessary data: Named Entities List, Frequency Dictionaries, Part of Speech Wordlist, Spelling List. The parser for part of speech tagging and spellchecker are now available for Russian language.
- Pattern Python 3 Release
Future Work
There are many opportunities to continue improving the Pattern framework and introduce new functionality. For example, the web mining module can be extended with some other features helping users to analyze the collected data from social media. Also it is important to add the sentiment analysis to pattern.ru part.
While I was working on Pattern Python 3 release I was collecting the political tweets from Twitter and posts from VKontakte to make big dataset which can help to analyze political debate tweets and political discussions. When the collection process is completed, the data set will be available to researchers.