Beating Google With CouchDB, Celery and Whoosh (Part 8)

 class=

In the previous seven posts I’ve gone through all the stages in building a search engine. If you want to try and run it for yourself and tweak it to make it even better then you can. I’ve put the code up on GitHub. All I ask is that if you beat Google, you give me a credit somewhere.

When you’ve downloaded the code it should prove to be quite simple to get running. First you’ll need to edit settings.py. It should work out of the box, but you should change the USER_AGENT setting to something unique. You may also want to adjust some of the other settings, such as the database connection or CouchDB urls.n To set up the CouchDB views type python manage.py update_couchdb.

Next, to run the celery daemon you’ll need to type the following two commands:

python manage.py celeryd -Q retrieve
python manage.py celeryd -Q process

This sets up the daemons to monitor the two queues and process the tasks. As mentioned in a previous post two queues are needed to prevent one set of tasks from swamping the other.

Next you’ll need to run the full text indexer, which can be done with python manage.py index_update and then you’ll want to run the server using python manage.py runserver.

At this point you should have several process running not doing anything. To kick things off we need to inject one or more urls into the system. You can do this with another management command, python manage.py start_crawl http://url. You can run this command as many times as you like to seed your crawler with different pages. It has been my experience that the average page has around 100 links on it so it shouldn’t take long before your crawler is scampering off to crawl many more pages that you initially seeded it with.

So, how well does Celery work with CouchDB as a backend? The answer is that it’s a bit mixed. Certainly it makes it very easy to get started as you can just point it at the server and it just works. However, the drawback, and it’s a real show stopper, is that the Celery daemon will poll the database looking for new tasks. This polling, as you scale up the number of daemons will quickly bring your server to its knees and prevent it from doing any useful work.

The disappointing fact is that Celery could watch the _changes feed rather than polling. Hopefully this will get fixed in a future version. For now though, for anything other experimental scale installations RabbitMQ is a much better bet.

Hopefully this series has been useful to you, and please do download the code and experiment with it!

Want to read more like this? Follow me with your favourite feed reader (e.g. Feedly), or subscribe to my SubStack newsletter.

Comments