Recently, while trying to work on a query parser feature in Weblate, I came across this search engine library called Whoosh. It provides certain nice features like indexing of text, parsing of search queries, scoring algorithms, etc. One good thing about this library is most of these features are customizable and extensible.
Now, the feature I was trying to implement is an exact search query. An exact search query would behave in a way such that the backend would search for an exact match of any query text provided to it instead of the normal substring search. Whoosh provides a plugin for regex, which can be accessed via
whoosh.qparser.RegexPlugin(). So we can technically go about writing a regex to do the exact match. But a regex search will have worse performance than a simple string comparison.
So, one of the ways of doing a new kind of query parsing is creating a custom whoosh plugin. And that's what this blog is going to be about.
Simple Whoosh Plugin
In some cases, you will probably not need a complicated plugin, but just want to extend the feature of an existing plugin to match a different kind of query. For example, let's say you want to extend the ability of
SingleQuotePlugin to parse queries wrapped in either single-quotes or double-quotes.
class QuotePlugin(whoosh.qparser.SingleQuotePlugin): """Single and double quotes to specify a term.""" expr = r"(^|(?<=\W))['\"](?P<text>.*?)['\"](?=\s|\]|[)}]|$)"
In the above example,
QuotePlugin extends the already existing
SingleQuotePlugin class. It just overrides the expression to parse the query. The expression, mentioned in the variable
expr is usually a regex expression with
?P<text> part denoting the
TermQuery is the final term/terms searched for in the database. So in the above regex, we say to parse any query such that the
TermQuery is wrapped in between single-quotes or double-quotes.
A query class is the class, whose instance the final parsed term will be. Unless otherwise mentioned, it's usually
<Term>. So if we want our plugin to parse the query and show it as an instance of a custom class, we need to define a custom query class.
class Exact(whoosh.query.Term): """Class for queries with exact operator.""" pass
So, as you can say, we can just have a simple class just extending
whoosh.query.Term so that while checking the parsed terms, we can get is as an instance of
Exact. That will help us differentiate the query from a normal
Custom Whoosh Plugin
After writing the query class, we will need to write the custom plugin class.
class ExactPlugin(whoosh.qparser.TaggingPlugin): """Exact match plugin with quotes to specify an exact term.""" class ExactNode(whoosh.qparser.syntax.TextNode): qclass = Exact def r(self): return "Exact %r" % self.text expr = r"\=(^|(?<=\W))(['\"]?)(?P<text>.*?)\2(?=\s|\]|[)}]|$)" nodetype = ExactNode
In the above example, unlike the simple case, we extend
TaggingPlugin instead of any other pre-defined plugin. Most of the pre-defined plugins in whoosh also extend
TaggingPlugin. So it is a good fit as a parent class.
Then, we create a
ExactNode class. This we will assign to the node type for the custom plugin. A node type class basically defines the query class to be used in this custom plugin, along with various representations and properties of the parsed node.
qclass will have the query class created before to denote the
Exact instance to the final parsed term.
Apart from that, we have the
expr which contains the regex just like in the simple example to parse the query term.
After creating the custom plugin, you can:
- add this plugin to the list of plugins defined in the whoosh query parser class
- use the query class to make an
isinstance()check when making database queries
- check for the node type in the different nodes used by the parser