27 Sep 2011
Ok, let’s get this out of the way right at the start - the title is a huge overstatement. This series of posts
will show you how to create a search engine using standard Python tools like Django, Celery and Whoosh with
CouchDB as the backend.
Celery is a message passing library that makes it really easy to run
background tasks and to spread them across a number of nodes. The most recent release added the NoSQL database
CouchDB as a possible backend. I’m a huge fan of CouchDB, and the
idea of running both my database and message passing backend on the same software really appealed to me.
Unfortunately the documentation doesn’t make it clear what you need to do to get CouchDB working, and what the
downsides are. I decided to write this series partly to explain how Celery and CouchDB work, but also to
experiment with using them together.
In this series I’m going to talk about setting up Celery to work with Django, using CouchDB as a backend. I’m
also going to show you how to use Celery to create a web-crawler. We’ll then index the crawled pages using
Whoosh and use a
PageRank-like algorithm to help rank the results. Finally,
we’ll attach a simple Django frontend to the search engine for querying it.
Read More...
20 May 2011
Recently Google
announced that
they were making their crowd sourcing mapping tools available to
users in the United States. This tool lets uses edit Google Maps, adding businesses and even roads, railways
and rivers. This raises interesting questions about whether wisdom of the crowd can be applied to data that
requires a high degree of accuracy.
Open Street Map has been doing this since 2004, and has put
together an amazing resource of free map data, but only recently has Google begun to allow people to edit its
maps for large parts of the world.
Accurate mapping data is terribly important. While the majority of Google Maps queries are likely to be “how
do I get from my house to my aunt’s?” some are much more important. A war was almost caused when
the border
between Nicaraguan and Costa Rica was incorrectly placed. While a war is a little far-fetched, it’s not
hard to imagine how a mistake on map could cost someone’s life in a medical emergency.
Read More...
08 Apr 2011
Recently I’ve been working a couple of open source projects and as part of them I’ve been using some
libraries. In order to use a library though, you need to understand how it is designed, what function calls
are available and what those functions do. The two libraries I’ve been using are
Qt and libavformat, which is part of
FFmpeg and they show two ends of the documentation spectrum.
Now, it’s important to note that Qt is a massive framework owned by Nokia, with hundreds of people working on
it full-time including a number of people dedicated to documentation. FFmpeg on the other hand is a purely
volunteer effort with only a small number of developers working on it. Given the complicated nature of video
encoding to have a very stable and feature-full library such as FFmpeg available as open source software is
almost a miracle. Comparing the levels of documentation between these two projects is very unfair, but it
serve as a useful example of where documentation can sometimes be lacking across all types of projects, both
open and closed source.n So, lets look at what documentation it is important to write by considering how you
might approach using a new library.
When you start using some code that you’ve not interacted with before the first thing that you need is to get
a grasp on the concepts underlying the library. Some libraries are functional, some object orientated. Some
use callbacks, others signals and slots. You also need to know the top level groupings of the elements in the
library so you can narrow your focus that parts of the library you actually want to use.
Read More...
14 Mar 2011
Last year Nokia started developing their own Python bindings for Qt,
PySide, when they couldn’t persuade Riverbank Computing to relicense
PyQt under a more liberal license. While
developing DjangoDE I made the choice of which library to use
configurable. When running under PyQt everything worked fine, but when using PySide the program hung on exit.
Using gdb to see where it was hanging points to
QFileSystemWatcher, which has the following comment in
the destructor.
Note: To avoid deadlocks on shutdown, all instances of QFileSystemWatcher need to be destroyed
before QCoreApplication. Note that passing QCoreApplication::instance() as the parent object
when creating QFileSystemWatcher is not sufficient.
Read More...
07 Mar 2011
There are a number of tools for checking whether your Python code meets a coding standard. These include
pep8.py, PyChecker
and PyLint. Of these, PyLint is the most comprehensive and is the
tool which I prefer to use as part of
my buildbot checks
that run on every commit.
PyLint works by parsing the Python source code itself and checking things like using variables that aren’t
defined, missing doc strings and a large array of other checks. A downside of PyLint’s comprehensiveness is
that it runs the risk of generating false positives. As it parses the source code itself it struggles with
some of Python’s more dynamic features, in particular
metaclasses, which, unfortunately,
are a key part of Django. In this post I’ll go through the changes I make to the standard PyLint settings to
make it more compatible with Django.
disable=W0403,W0232,E1101
This line disables a few problems that are picked up entirely. W0403
stops relative imports from
generating a warning, whether you want to disable these or not is really a matter of personal preference.
Although I appreciate why there is a check for this, I think this is a bit too picky. W0232
stops a
warning appearing when a class has no __init__
method. Django models will produce this warning, but
because they’re metaclasses there is nothing wrong with them. Finally, E1101
is generated if you
access a member variable that doesn’t exist. Accessing members such as id
or objects
on a
model will trigger this, so it’s simplest just to disable the check.
Read More...