Postgres Full Text Search on Django JSON Fields with KT Expressions

The Django developers have worked hard to integrate Postgres full text search into the web framework’s ORM. Unfortunately, there are few quirks with the system, especially when working with JSON model fields. In this post, we’ll discuss those issues and how to work around them.

Search Lookup Versus SearchVector

If you have a simple Document model class, such as:

class Document(models.Model):
    title = models.CharField(default="")
    body = models.CharField(default="")

You can perform a basic full text search of body with:

Document.objects.filter(body__search="Example")

Which produces the SQL:

SELECT "document"."id", "document"."title", "document"."body"
FROM "document"
WHERE to_tsvector(COALESCE("document"."body", '')) @@ (plainto_tsquery('Example'));

The above will match any document containing the case insensitive word stem exampl. This includes words such as examples and EXAMPLING.

To perform the above query, you’ll need to be using Postgres as your database. You also need to add django.contrib.postgres to your INSTALLED_APPS or you will receive a django.core.exceptions.FieldError: Unsupported lookup 'search' error message when performing the above query.

To improve search results, you can implement ranking, weighting, and different languages using SearchVector:

title_vector = SearchVector("title", config="French", weight="A")
body_vector = SearchVector("body", config="French", weight="B")
query = SearchQuery("Exemple", config="French")
results = (
    Document.objects.annotate(
        rank=SearchRank(title_vector + body_vector, query)
    )
    .filter(rank__gte=0.1)
    .order_by("-rank")
)

If you’re changing the language, make sure to set the config for both SearchVector and SearchQuery. More features and examples are shown in the Django documentation for full text search.

JSON Fields

The previous examples used a model with CharField content. We could also put all those fields into a JSONfield, allowing us to extend the Document without additional migrations:

class Document(models.Model):
    content = models.JSONField(default=dict)

For instance, by adding an author field:

Document.objects.create(
    content={
        "title": "Example Title",
        "author": "Myself",
        "body": "The body of text that we would want to search.",
    }
)

These fields can be filtered by using a double underscore __ path lookup such as:

Document.objects.filter(content__author="Myself")

The ORM translates this to a query using the -> operator, which extracts the jsonb object at the given key, and by casting the filter term to jsonb before testing for a match:

SELECT "document"."id", "document"."content"
FROM "document"
WHERE ("document"."content" -> 'author') = '"Myself"'::JSONB;

Unfortunately, this lookup uses the same double underscore syntax as the first search lookup we used before. Attempting a basic full text search with search results in a nested JSON lookup using the #> operator instead of the full text lookup we expected:

Document.objects.filter(content__body__search="Example")

SELECT "document"."id", "document"."content"
FROM "document"
WHERE ("document"."content" #> '{body,search}') = '"Example"'::JSONB;

Thankfully, the above can be fixed with Django 4.2’s newly introduced KT expression.

Document.objects.annotate(body=KT("content__body")).filter(body__search="Example")

A bit unwieldy, but it generates the SQL we expect:

SELECT "document"."id", "document"."content",
       ("document"."content" ->> 'body') AS "body"
FROM "document"
WHERE to_tsvector(coalesce(("document"."content" ->> 'body'), '')) @@ (plainto_tsquery('Example'));

Bad Tokenization

When using SearchVector, it may appear that the KT expression isn’t needed:

title_vector = SearchVector("content__title", weight="A")
body_vector = SearchVector("content__body", weight="B")
query = SearchQuery("Example")
results = (
    Document.objects.annotate(
        rank=SearchRank(title_vector + body_vector, query)
    )
    .filter(rank__gte=0.1)
    .order_by("-rank")
)

The above may even return the results you expect, but there is a subtle bug hiding in its SQL:

SELECT "document"."id", "document"."content",
ts_rank((setweight(
        to_tsvector(coalesce((("document"."content" -> 'title'))::text, '')), 'A') || setweight(
        to_tsvector(coalesce((("document"."content" -> 'body'))::text, '')), 'B')), plainto_tsquery('Example')) AS "rank"
FROM "document"
WHERE ts_rank((setweight(
    to_tsvector(coalesce((("document"."content" -> 'title'))::text, '')), 'A') || setweight(
    to_tsvector(coalesce((("document"."content" -> 'body'))::text, '')), 'B')), plainto_tsquery('Example')) >= 0.1
ORDER BY 3 DESC;

The SearchVector is generating its expression using the -> operator, as opposed to the KT expression’s ->>. The -> operator will return jsonb, which then needs to be cast to text in order to be used in to_tsvector, while ->> directly returns a text type.

These are not equivalent operations, as you can see from the following example:

SELECT to_tsvector(E'A string of \ntext with a newline');
          to_tsvector
--------------------------------
 'newlin':7 'string':2 'text':4

SELECT to_tsvector('"A string of \ntext with a newline"'::jsonb::text);
           to_tsvector
---------------------------------
 'newlin':7 'ntext':4 'string':2

If you’re using SearchVector with a JSONField, it’s possible that you’ve been generating the wrong stem words. Thankfully, we can also correct this with KT expressions:

vector = SearchVector("title", weight="A") + SearchVector("body", weight="B")
query = SearchQuery("Example")

results = (
    Document.objects.annotate(
        title=KT("content__title"),
        body=KT("content__body"),
        rank=SearchRank(vector, query),
    )
    .filter(rank__gte=0.1)
    .order_by("-rank")
)

Which produces SQL that uses the ->> operator:

SELECT "document"."id", "document"."content",
("document"."content" ->> 'title') AS "title",
("document"."content" ->> 'body') AS "body",
ts_rank((setweight(
    to_tsvector(COALESCE(("document"."content" ->> 'title'), '')), 'A') || setweight(
    to_tsvector(COALESCE(("document"."content" ->> 'body'), '')), 'B')),
plainto_tsquery('Example')) AS "rank"
FROM "document"
WHERE ts_rank((setweight(
    to_tsvector(COALESCE(("document"."content" ->> 'title'), '')), 'A') || setweight(
    to_tsvector(COALESCE(("document"."content" ->> 'body'), '')), 'B')), plainto_tsquery('Example')) >= 0.1
ORDER BY 5 DESC;

Other Common Mistakes

Even with the correct syntax, you may be disappointed with your results for certain search terms. In addition to creating stem words, the default english dictionary will drop commonly used words, also known as stop words. For instance:

SELECT to_tsvector('english', 'A string with some of the example stop words');
               to_tsvector
-----------------------------------------
 'exampl':7 'stop':8 'string':2 'word':9

For more info on stem and stop words, use ts_debug:

SELECT * FROM ts_debug('english', 'A string with some of the example stop words');

   alias   |   description   |  token  |  dictionaries  |  dictionary  | lexemes
-----------+-----------------+---------+----------------+--------------+----------
 asciiword | Word, all ASCII | A       | {english_stem} | english_stem | {}
 blank     | Space symbols   |         | {}             |              |
 asciiword | Word, all ASCII | string  | {english_stem} | english_stem | {string}
 blank     | Space symbols   |         | {}             |              |
 asciiword | Word, all ASCII | with    | {english_stem} | english_stem | {}
 blank     | Space symbols   |         | {}             |              |
 asciiword | Word, all ASCII | some    | {english_stem} | english_stem | {}
 blank     | Space symbols   |         | {}             |              |
 asciiword | Word, all ASCII | of      | {english_stem} | english_stem | {}
 blank     | Space symbols   |         | {}             |              |
 asciiword | Word, all ASCII | the     | {english_stem} | english_stem | {}
 blank     | Space symbols   |         | {}             |              |
 asciiword | Word, all ASCII | example | {english_stem} | english_stem | {exampl}
 blank     | Space symbols   |         | {}             |              |
 asciiword | Word, all ASCII | stop    | {english_stem} | english_stem | {stop}
 blank     | Space symbols   |         | {}             |              |
 asciiword | Word, all ASCII | words   | {english_stem} | english_stem | {word}

If you need to search strings with relevant stop words, you can also use the built-in simple dictionary:

SELECT to_tsvector('simple', 'A string with some of the example stop words');
                                   to_tsvector
----------------------------------------------------------------------------------
 'a':1 'example':7 'of':5 'some':4 'stop':8 'string':2 'the':6 'with':3 'words':9

Note that this also disables stemming, so stem variants, such as plurals and verbs, will no longer match. For more information on dictionaries, including how to create your own, see the Postgres documentation.

You can see a full list of available dictionaries on your Postgres installation with \dFd and the parameter you’d need to pass to Django’s config option with SELECT cfgname FROM pg_ts_config;

Happy hacking!