Beating Google With CouchDB, Celery and Whoosh (Part 2)

In this series I’ll show you how to build a search engine using standard Python tools like Django, Whoosh and CouchDB. In this post we’ll begin by creating the data structure for storing the pages in the database, and write the first parts of the webcrawler.

CouchDB’s Python library has a simple ORM system that makes it easy to convert between the JSON objects stored in the database and a Python object.

To create the class you just need to specify the names of the fields, and their type. So, what do a search engine need to store? The url is an obvious one, as is the content of the page. We also need to know when we last accessed the page. To make things easier we’ll also have a list of the urls that the page links to. One of the great advantages of a database like CouchDB is that we don’t need to create a separate table to hold the links, we can just include them directly in the main document. To help return the best pages we’ll use a page rank like algorithm to rank the page, so we also need to store that rank. Finally, as is good practice on CouchDB we’ll give the document a type field so we can write views that only target this document type.

Read More...

Beating Google With CouchDB, Celery and Whoosh (Part 1)

Ok, let’s get this out of the way right at the start - the title is a huge overstatement. This series of posts will show you how to create a search engine using standard Python tools like Django, Celery and Whoosh with CouchDB as the backend.

Celery is a message passing library that makes it really easy to run background tasks and to spread them across a number of nodes. The most recent release added the NoSQL database CouchDB as a possible backend. I’m a huge fan of CouchDB, and the idea of running both my database and message passing backend on the same software really appealed to me. Unfortunately the documentation doesn’t make it clear what you need to do to get CouchDB working, and what the downsides are. I decided to write this series partly to explain how Celery and CouchDB work, but also to experiment with using them together.

In this series I’m going to talk about setting up Celery to work with Django, using CouchDB as a backend. I’m also going to show you how to use Celery to create a web-crawler. We’ll then index the crawled pages using Whoosh and use a PageRank-like algorithm to help rank the results. Finally, we’ll attach a simple Django frontend to the search engine for querying it.

Read More...

Crowd Sourcing Mapping

Recently Google announced that they were making their crowd sourcing mapping tools available to users in the United States. This tool lets uses edit Google Maps, adding businesses and even roads, railways and rivers. This raises interesting questions about whether wisdom of the crowd can be applied to data that requires a high degree of accuracy.

Open Street Map has been doing this since 2004, and has put together an amazing resource of free map data, but only recently has Google begun to allow people to edit its maps for large parts of the world.

Accurate mapping data is terribly important. While the majority of Google Maps queries are likely to be “how do I get from my house to my aunt’s?” some are much more important. A war was almost caused when the border between Nicaraguan and Costa Rica was incorrectly placed. While a war is a little far-fetched, it’s not hard to imagine how a mistake on map could cost someone’s life in a medical emergency.

Read More...

The Importance Of Documentation

Recently I’ve been working a couple of open source projects and as part of them I’ve been using some libraries. In order to use a library though, you need to understand how it is designed, what function calls are available and what those functions do. The two libraries I’ve been using are Qt and libavformat, which is part of FFmpeg and they show two ends of the documentation spectrum.

Now, it’s important to note that Qt is a massive framework owned by Nokia, with hundreds of people working on it full-time including a number of people dedicated to documentation. FFmpeg on the other hand is a purely volunteer effort with only a small number of developers working on it. Given the complicated nature of video encoding to have a very stable and feature-full library such as FFmpeg available as open source software is almost a miracle. Comparing the levels of documentation between these two projects is very unfair, but it serve as a useful example of where documentation can sometimes be lacking across all types of projects, both open and closed source.n So, lets look at what documentation it is important to write by considering how you might approach using a new library.

When you start using some code that you’ve not interacted with before the first thing that you need is to get a grasp on the concepts underlying the library. Some libraries are functional, some object orientated. Some use callbacks, others signals and slots. You also need to know the top level groupings of the elements in the library so you can narrow your focus that parts of the library you actually want to use.

Read More...

Deadlock On Exit With PySide And QFileSystemWatcher

Last year Nokia started developing their own Python bindings for Qt, PySide, when they couldn’t persuade Riverbank Computing to relicense PyQt under a more liberal license. While developing DjangoDE I made the choice of which library to use configurable. When running under PyQt everything worked fine, but when using PySide the program hung on exit.

Using gdb to see where it was hanging points to QFileSystemWatcher, which has the following comment in the destructor.

Note: To avoid deadlocks on shutdown, all instances of QFileSystemWatcher need to be destroyed before QCoreApplication. Note that passing QCoreApplication::instance() as the parent object when creating QFileSystemWatcher is not sufficient.

Read More...