13 Oct 2011
We’re nearing the end of our plot to create a Google-beating search engine (in my dreams at least) and in
this post we’ll build the interface to query the index we’ve built up. Like Google the interface is very
simple, just a text box on one page and a list of results on another.
To begin with we just need a page with a query box. To make the page slightly more interesting we’ll also
include the number of pages in the index, and a list of the top documents as ordered by our ranking algorithm.
In the templates on this page we reference base.html
which provides the boiler plate code needed to
make an HTML page.
Read More...
11 Oct 2011
In this post we’ll continue building the backend for our search engine by implementing the algorithm we
designed in the last post for ranking pages. We’ll also build a index of our pages with
Whoosh, a pure-Python full-text indexer and
query engine.
To calculate the rank of a page we need to know what other pages link to a given url, and how many links that
page has. The code below is a CouchDB map called page/links_to_url
. For each page this will output a
row for each link on the page with the url linked to as the key and the page’s rank and number of links as the
value.
function (doc) {
if(doc.type == "page") {
for(i = 0; i < doc.links.length; i++) {
emit(doc.links[i], [doc.rank, doc.links.length]);
}
}
}
Read More...
06 Oct 2011
In this series I’m showing you how to build a webcrawler and search engine using standard Python based tools
like Django, Celery and Whoosh with a CouchDB backend. In previous posts we created a data structure, parsed
and stored robots.txt
and stored a single webpage in our document. In this post I’ll show you how to
parse out the links from our stored HTML document so we can complete the crawler, and we’ll start calculating
the rank for the pages in our database.
There are several different ways of parsing out the links in a given HTML document. You can just use a regular
expression to pull the urls out, or you can use a more complete but also more complicated (and slower) method
of parsing the HTML using the standard Python
htmlparser library, or the wonderful
Beautiful Soup. The point of this series isn’t to
build a complete webcrawler, but to show you the basic building blocks. So, for simplicity’s sake I’ll use a
regular expression.
link_single_re = re.compile(r"<a[^>]+href='([^']+)'")
link_double_re = re.compile(r'<a[^>]+href="([^"]+)"')
All we need to look for an href
attribute in an a
tag. We’ll use two regular expressions to
handle single and double quotes, and then build a list containing all the links in the document.
Read More...
04 Oct 2011
In this series I’ll show you how to build a search engine using standard Python tools like Django, Whoosh and
CouchDB. In this post we’ll start crawling the web and filling our database with the contents of pages.
One of the rules we set down was to not request a page too often. If, by accident, we try to retrieve a page
more than once a week then don’t want that request to actually make it to the internet. To help prevent this
we’ll extend the Page
class we created in the last post with a function called get_by_url
.
This static method will take a url and return the Page object that represents it, retrieving the page if we
don’t already have a copy. You could create this as an independent function, but I prefer to use static
methods to keep things tidy.
We only actually want to retrieve the page from the internet in one of the three tasks the we’re going to
create so we’ll give get_by_url
a parameter, update
that enables us to return None
if we don’t have a copy of the page.
@staticmethod
def get_by_url(url, update=True):
r = settings.db.view("page/by_url", key=url)
if len(r.rows) == 1:
doc = Page.load(settings.db, r.rows[0].value)
if doc.is_valid():
return doc
elif not update:
return None
else:
doc = Page(url=url)
doc.update()
return doc
The key line in the static method is doc.update()
. This calls the function to retrieves the page and
makes sure we respect the robots.txt
file. Let’s look at what happens in that function.
Read More...
29 Sep 2011
In this
series
I’ll show you how to build a search engine using standard Python tools like Django, Whoosh and CouchDB. In
this post we’ll begin by creating the data structure for storing the pages in the database, and write the
first parts of the webcrawler.
CouchDB’s Python library has a simple ORM system
that makes it easy to convert between the JSON objects stored in the database and a Python object.
To create the class you just need to specify the names of the fields, and their type. So, what do a search
engine need to store? The url is an obvious one, as is the content of the page. We also need to know when we
last accessed the page. To make things easier we’ll also have a list of the urls that the page links to. One
of the great advantages of a database like CouchDB is that we don’t need to create a separate table to hold
the links, we can just include them directly in the main document. To help return the best pages we’ll use a
page rank like algorithm to rank the page, so we also need
to store that rank. Finally, as is good practice on CouchDB we’ll give the document a type
field so
we can write views that only target this document type.
Read More...