Filtering on annotations in Django

Django's annotate method can be useful for counting particular attributes on every member of a queryset (i.e. count the number of books an author has). Unfortunately you can't use the ORM to do any filtering on those attribute. This would be useful for only considering an attribute if it were above a certain value, or belonged to a certain object (i.e. count the number of books with the tag 'Horror'). This post shows how to make use of the extra method Django provides to create filtered annotations

Update

This is an old post. There are probably better ways to go about filtering on annotations these days. Buyer beware!

More than once in the last couple of weeks I have found myself in a situation where I need to annotate a queryset in Django, but the annotation requires some sort of filtering beforehand. Ideally, this would be achievable with the Django ORM by doing something like:

# Get all authors along with the total number of
# Books they have written that feature a particular Tag
authors = Author \
  .objects\
  .all()\
  .annotate(reviews=Count(books__tags__in=['Fiction', 'Literature']))

Unfortunately this is not possible. Django’s annotate method only allows you to count all instances of a particular field. We need to come up with a custom solution. Luckily we can make use of the extra method of the QuerySet API to insert our own custom SQL.

As an example to illustrate, let’s pretend that we have a music database of Artists and their Albums, where every album can be assigned one or more Genres. We want to offer the ability to get a list of all Artists ordered by the number of Albums they have in one or more particular Genres. For example “show me which artists have the most albums in the genre Folk or Indie”.

To get started, let’s look at the layout of our models:

class Artist(models.Model):
  name = models.CharField(max_length=512)
 
class Album(models.Model):
  name = models.CharField(max_length=512)
  artist = models.ForeignKey("Artist")
  genres = models.ManyToManyField("Genre")
 
class Genre(models.Model):
  name = models.CharField(max_length=512)

Here are some dummy fixtures that I’ve created:

Screenshot of terminal with text

Possible Approach

The first thought might be to simply iterate over the queryset of artists and manually add the count of all the albums with a particular genre or genres:

genres = Genre.objects.filter(name__in=["Folk", "Indie"])
ids = [g.id for g in genres]
artists = Artist.objects.all()
for artist in artists:
  artist.album_count = artist.album_set.filter(genres__in=ids).count()
artists.order_by("-album_count")

The limitation here is that it will generate an extra query for every artist in your DB. This is fine if you only have a couple of artists, but it will eventually become a problem as you add more data.

Screenshot of terminal with text

Solution

A much better approach is to make use of Django’s extra queryset method. This method allows us to inject extra SQL to perform operations that might otherwise not be possible through the ORM.

Specifically, we are interested in adding a query that will COUNT() the number of albums for each artist that have a relationship via our M2M table with the genres “Folk” and “Indie”:

genres = Genre.objects.filter(name__in=["Folk", "Indie"])
artists = Artist.objects.all().extra(select = {
  "album_count" : """
  SELECT COUNT(*)
  FROM sandbox_album
    JOIN sandbox_album_genres on sandbox_album_genres.album_id = sandbox_album.id
  WHERE sandbox_album.artist_id = sandbox_artist.id
  AND sandbox_album_genres.genre_id IN %s
  """ % "(%s)" % ",".join([str(g.id) for g in genres.all()])
}).order_by("-album_count",)

generating the following SQL:

SELECT (
  SELECT COUNT(*)
  FROM sandbox_album
    JOIN sandbox_album_genres
      on sandbox_album_genres.album_id = sandbox_album.id
  WHERE sandbox_album.artist_id = sandbox_artist.id
  AND sandbox_album_genres.genre_id IN (3,4)
) AS "album_count",
"sandbox_artist"."id", "sandbox_artist"."name"
FROM "sandbox_artist" ORDER BY "album_count" DESC

We can see that this only produces a single query:

Screenshot of terminal with text

yet gives us the correct results:

Screenshot of terminal with text

Let’s extend our example and imagine that we want to also track TourDates for artists. Every TourDate has a Country. We want to be able to get a list of all artists as well as the number of TourDates for one or more Countries. For example “which artists have the most upcoming concerts in Ireland?”.

Here are the new models:

class Country(models.Model):
  name = models.CharField(max_length=100)
 
class TourDate(models.Model):
  artist = models.ForeignKey("Artist")
  date = models.DateField()
  country = models.ForeignKey("Country")

Here is our custom SQL:

ireland = Country.objects.get(name="Ireland")
artists = Artist.objects.all().extra(select = {
  "tourdate_count" : """
  SELECT COUNT(*)
  FROM sandbox_tourdate
    JOIN sandbox_country on sandbox_tourdate.country_id = sandbox_country.id
  WHERE sandbox_tourdate.artist_id = sandbox_artist.id
  AND sandbox_tourdate.country_id = %d """ % ireland.pk,
}).order_by("-tourdate_count",)

and here is an example of the results:

Screenshot of terminal with text