Porting Pattern to Python 3: Phase II
The second GSoC coding period is over and has brought substantial progress. As of today, all of the submodules with the exception of pattern.server
have been ported to Python 3.6. Pattern now shows consistent behavior for both Python 2.7 and Python 3.6 across all modules. All unit tests for pattern.db
, pattern.metrics
, pattern.graph
and the language modules pattern.en
, pattern.nl
, pattern.de
, pattern.fr
, pattern.it
, pattern.es
pass, but there are still one or two failing test cases in pattern.text
, and pattern.vector
, as well as some skipped tests in pattern.web
due to changes in some web services' APIs.
Specifically, I have been working on the following issues in the second coding period:
June, 26 – July 26
I continued working on the removal of a bundled
pywordnet
version which has been deprecated since many years. A good part of the functionality is now integrated into NLTK, however, there have been many backward incompatible changes to the interface over the years, which required significant changes toen/wordnet/__init__.py
. I tried my best to hide all the changes in the backend from the Pattern user wherever possible, wrapping the new interface and maintaining the current Patternen.wordnet
interface. Since we now make use of NLTK's WordNet interface, this also makes thenltk
package a dependency from now on. The bundledpywordnet
version is completely removed now.Pattern comes with a bundled version of libsvm and liblinear which provide various fast, low–level routines for support vector machines (SVMs) and linear classification in
pattern.vector
. Both bundled versions were quite old, so I replaced both libraries with the most recent release and made the necessary changes to make them work with the Pattern code base and support Python 3. The pre–compiled libraries have been removed for now because they were incompatible with the newerlibsvm
/liblinear
versions. However, we might put some pre–compiled binaries for some platforms back in at some point.Another major issue was some refactoring in
pattern.web
, most importantly the removal ofsgmllib
which is deprecated in Python 3. Fortunately, we are able to baseHTMLParser
inpattern.web
upon the same class inhtml.parser
with some small adjustments (da00ff).In the first coding period, I removed the bundled version of BeautifulSoup from the code base and made it an external dependency. This period, I upgraded the code to make use of the most recent version BeautifulSoup 4 which also supports Python 3. As a result of this, some refactoring was done in
pattern.web
to account for backward incompatible changes to the parser interface. Furthermore, we now explicitly make use of the fastlxml
parser for HTML/XML and consequently, thelxml
package is another dependency now.I removed the custom JSON parser in
pattern.db
since thejson
module is part of the standard library now.pattern.web
contains routines to deal with PDF documents through thepdfminer
library. There have been some inconsistencies between Python 2.7 and Python 3.6 which resulted in weird exceptions being raised. Currently, the problem is solved by using thepdfminer
package for Python 2 andpdfminer.six
for Python 3, however, this should ideally be refactored and unified at some point.There has been a long-standing bug with the single layer perceptron (SLP) (#182) that was haunting me and that I couldn't resolve for weeks. As a consequence of this bug, the majority of the unit tests for
pattern.en
failed. Last week, I ended up manually going through the commit history using essentially a binary search approach until I narrowed down the cause of the problem. Finally, all the problems are fixed as of 93235fe and the unit test landscape looks much cleaner now!I also spend a lot of time making Python 2 and Python 3 behave consistently throughout all modules. This involved taking care of many of the subtle differences under the hood that I talked about in my first report. In order to avoid surprises for future developers who might not be aware of the differences between Python 2 and Python 3, I decided to put the following imports to the top of every non–trivial file to enforce consist behavior for the most important parts:
from __future__ import unicode_literals from __future__ import print_function from __future__ import absolute_import from __future__ import division from builtins import str, bytes, int from builtins import map, zip, filter from builtins import object, range
This should cover the most important differences and enforce Python 3–like division, imports, handling of literals and classes derived from
object
. Hunting down bugs in either Python 2 or Python 3 is laborious and time-consuming when you are unaware of what is really happening and different interpreters yield different results. Consequently, there should be a "no surprises" philosophy when it comes the behavior of rudimentary data types such asstr
,bytes
,int
or functions such asmap(), zip(), filter()
, justifying the above explicit declarations even if not all of them are strictly necessary right now.There were many encoding issues to be covered in various modules this period to make the code base work with both Python 2 and Python 3, predominantly in
pattern.text
,pattern.en
,pattern.vector
andpattern.web
. All string literals are now unicode by default (from __future__ import unicode literals
), and functions expect unicode inputs if not stated otherwise. Thestr
object fromfuture
makes Python 2 behave like a Python 3str
(which is always unicode).