The Django developers have worked hard to integrate Postgres full text search into the web framework’s ORM. Unfortunately, there are few quirks with the system, especially when working with JSON model fields. In this post, we’ll discuss those issues and how to work around them.
Search Lookup Versus SearchVector
If you have a simple Document
model class, such as:
class Document(models.Model):
title = models.CharField(default="")
body = models.CharField(default="")
You can perform a basic full text search of body
with:
Document.objects.filter(body__search="Example")
Which produces the SQL:
SELECT "document"."id", "document"."title", "document"."body"
FROM "document"
WHERE to_tsvector(COALESCE("document"."body", '')) @@ (plainto_tsquery('Example'));
The above will match any document containing the case insensitive word stem exampl
. This includes words such as examples
and EXAMPLING
.
To perform the above query, you’ll need to be using Postgres as your database. You also need to add django.contrib.postgres
to your INSTALLED_APPS
or you will receive a django.core.exceptions.FieldError: Unsupported lookup 'search'
error message when performing the above query.
To improve search results, you can implement ranking, weighting, and different languages using SearchVector
:
title_vector = SearchVector("title", config="French", weight="A")
body_vector = SearchVector("body", config="French", weight="B")
query = SearchQuery("Exemple", config="French")
results = (
Document.objects.annotate(
rank=SearchRank(title_vector + body_vector, query)
)
.filter(rank__gte=0.1)
.order_by("-rank")
)
If you’re changing the language, make sure to set the config for both SearchVector
and SearchQuery
. More features and examples are shown in the Django documentation for full text search.
JSON Fields
The previous examples used a model with CharField
content. We could also put all those fields into a JSONfield
, allowing us to extend the Document
without additional migrations:
class Document(models.Model):
content = models.JSONField(default=dict)
For instance, by adding an author
field:
Document.objects.create(
content={
"title": "Example Title",
"author": "Myself",
"body": "The body of text that we would want to search.",
}
)
These fields can be filtered by using a double underscore __
path lookup such as:
Document.objects.filter(content__author="Myself")
The ORM translates this to a query using the ->
operator, which extracts the jsonb
object at the given key, and by casting the filter term to jsonb
before testing for a match:
SELECT "document"."id", "document"."content"
FROM "document"
WHERE ("document"."content" -> 'author') = '"Myself"'::JSONB;
Unfortunately, this lookup uses the same double underscore syntax as the first search
lookup we used before. Attempting a basic full text search with search
results in a nested JSON lookup using the #>
operator instead of the full text lookup we expected:
Document.objects.filter(content__body__search="Example")
SELECT "document"."id", "document"."content"
FROM "document"
WHERE ("document"."content" #> '{body,search}') = '"Example"'::JSONB;
Thankfully, the above can be fixed with Django 4.2’s newly introduced KT
expression.
Document.objects.annotate(body=KT("content__body")).filter(body__search="Example")
A bit unwieldy, but it generates the SQL we expect:
SELECT "document"."id", "document"."content",
("document"."content" ->> 'body') AS "body"
FROM "document"
WHERE to_tsvector(coalesce(("document"."content" ->> 'body'), '')) @@ (plainto_tsquery('Example'));
Bad Tokenization
When using SearchVector
, it may appear that the KT
expression isn’t needed:
title_vector = SearchVector("content__title", weight="A")
body_vector = SearchVector("content__body", weight="B")
query = SearchQuery("Example")
results = (
Document.objects.annotate(
rank=SearchRank(title_vector + body_vector, query)
)
.filter(rank__gte=0.1)
.order_by("-rank")
)
The above may even return the results you expect, but there is a subtle bug hiding in its SQL:
SELECT "document"."id", "document"."content",
ts_rank((setweight(
to_tsvector(coalesce((("document"."content" -> 'title'))::text, '')), 'A') || setweight(
to_tsvector(coalesce((("document"."content" -> 'body'))::text, '')), 'B')), plainto_tsquery('Example')) AS "rank"
FROM "document"
WHERE ts_rank((setweight(
to_tsvector(coalesce((("document"."content" -> 'title'))::text, '')), 'A') || setweight(
to_tsvector(coalesce((("document"."content" -> 'body'))::text, '')), 'B')), plainto_tsquery('Example')) >= 0.1
ORDER BY 3 DESC;
The SearchVector
is generating its expression using the ->
operator, as opposed to the KT
expression’s ->>
. The ->
operator will return jsonb
, which then needs to be cast to text
in order to be used in to_tsvector
, while ->>
directly returns a text
type.
These are not equivalent operations, as you can see from the following example:
SELECT to_tsvector(E'A string of \ntext with a newline');
to_tsvector
--------------------------------
'newlin':7 'string':2 'text':4
SELECT to_tsvector('"A string of \ntext with a newline"'::jsonb::text);
to_tsvector
---------------------------------
'newlin':7 'ntext':4 'string':2
If you’re using SearchVector
with a JSONField
, it’s possible that you’ve been generating the wrong stem words. Thankfully, we can also correct this with KT
expressions:
vector = SearchVector("title", weight="A") + SearchVector("body", weight="B")
query = SearchQuery("Example")
results = (
Document.objects.annotate(
title=KT("content__title"),
body=KT("content__body"),
rank=SearchRank(vector, query),
)
.filter(rank__gte=0.1)
.order_by("-rank")
)
Which produces SQL that uses the ->>
operator:
SELECT "document"."id", "document"."content",
("document"."content" ->> 'title') AS "title",
("document"."content" ->> 'body') AS "body",
ts_rank((setweight(
to_tsvector(COALESCE(("document"."content" ->> 'title'), '')), 'A') || setweight(
to_tsvector(COALESCE(("document"."content" ->> 'body'), '')), 'B')),
plainto_tsquery('Example')) AS "rank"
FROM "document"
WHERE ts_rank((setweight(
to_tsvector(COALESCE(("document"."content" ->> 'title'), '')), 'A') || setweight(
to_tsvector(COALESCE(("document"."content" ->> 'body'), '')), 'B')), plainto_tsquery('Example')) >= 0.1
ORDER BY 5 DESC;
Other Common Mistakes
Even with the correct syntax, you may be disappointed with your results for certain search terms. In addition to creating stem words, the default english
dictionary will drop commonly used words, also known as stop words. For instance:
SELECT to_tsvector('english', 'A string with some of the example stop words');
to_tsvector
-----------------------------------------
'exampl':7 'stop':8 'string':2 'word':9
For more info on stem and stop words, use ts_debug
:
SELECT * FROM ts_debug('english', 'A string with some of the example stop words');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+---------+----------------+--------------+----------
asciiword | Word, all ASCII | A | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | string | {english_stem} | english_stem | {string}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | with | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | some | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | of | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | the | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | example | {english_stem} | english_stem | {exampl}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | stop | {english_stem} | english_stem | {stop}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | words | {english_stem} | english_stem | {word}
If you need to search strings with relevant stop words, you can also use the built-in simple
dictionary:
SELECT to_tsvector('simple', 'A string with some of the example stop words');
to_tsvector
----------------------------------------------------------------------------------
'a':1 'example':7 'of':5 'some':4 'stop':8 'string':2 'the':6 'with':3 'words':9
Note that this also disables stemming, so stem variants, such as plurals and verbs, will no longer match. For more information on dictionaries, including how to create your own, see the Postgres documentation.
You can see a full list of available dictionaries on your Postgres installation with \dFd
and the parameter you’d need to pass to Django’s config option with SELECT cfgname FROM pg_ts_config;
Happy hacking!