Searching Stemmed Fields With Whoosh

 class=

Whoosh is quite a nice pure-python full text search engine. While it is still being actively developed and is suitable for production usage there are still some rough edges. One problem that stumped me for a while was searching stemmed fields.

Stemming is where you take the endings off words, such as ‘ings’ on the word endings. This reduces the accuracy of searches but greatly increases the chances of users finding something related to what they were looking for.n To create a stemmed field you need to tell Whoosh to use the StemmingAnalyzer, as shown in the schema definition below.

from whoosh.analysis import StemmingAnalyzer
from whoosh.fields import Schema, TEXT, ID
schema = Schema(id=ID(stored=True, unique=True),
                       text=TEXT(analyzer=StemmingAnalyzer()))

Using the StemmingAnalyzer will cause Whoosh to stem every word before it is added to the index. If you use the shortcut search function to search with a word that should be stemmed it will return no results, as that word does not exist in the index, even though it was included in the data that was indexed.

To correctly search a stemmed index you must parse the query and tell the parse to use the Variations term class. The causes the words in the query to also be stemmed, so they correctly match words in the stemmed index.

searcher = ix.searcher()
qp = QueryParser("text", schema=schema, termclass=Variations)
parsed = qp.parse(query)
docs = searcher.search(parsed)
Want to read more like this? Follow me with your favourite feed reader (e.g. Feedly), or subscribe to my SubStack newsletter.

Comments

Hi,

This post is pretty old now, so maybe at that time, what I will say here was not correct, but as of now, as soon as you set a stemmed field up, and specify the right schema to the QueryParser, you do not need to set the termclass to whoosh.query.Variations.

In fact, if you use Variations of the user's query, you most likely wouldn't use stemming at indexing time. It's either using stemming at indexing-time, either using the morphological variations of the query at querying-time.

The results are the same, just, using variations is more computing-power consuming at querying time since your python application needs to compare every querying term variation to the terms in the indexed field.

In addition, in the case you want to use the Variations termclass of the terms in the user's query instead of stemming the fields while indexing, you would need to import the Variations class from whoosh.query before using it, like this: `from whoosh.query import Variations` ;)

Anyway, thanks for this post, it helped me a lot while investigating in the good direction now that I'm discovering Whoosh.

Cheers

Cédric Beuzit

14 May 2013