Relationships and Joins

Relational databases derive most of their power from the ability to link rows across tables. This document explains how Peewee models those links, what happens under the hood when you traverse them, and how to write queries that cross table boundaries efficiently.

By the end of this document you will understand:

  • How ForeignKeyField behaves at runtime, not just at schema definition time.

  • What a back-reference is and when to use one.

  • What the N+1 problem is and how to recognise it.

  • How to write joins, including multi-table and self-referential joins.

  • How many-to-many relationships are modelled.

  • When to use prefetch() instead of a join.

Model Definitions

All examples in this document use the following three models. They will be defined once here and reused throughout.

import datetime
from peewee import *

db = SqliteDatabase(':memory:')

class BaseModel(Model):
    class Meta:
        database = db

class User(BaseModel):
    username = TextField()

class Tweet(BaseModel):
    user = ForeignKeyField(User, backref='tweets')
    content = TextField()
    timestamp = DateTimeField(default=datetime.datetime.now)

class Favorite(BaseModel):
    user = ForeignKeyField(User, backref='favorites')
    tweet = ForeignKeyField(Tweet, backref='favorites')

A ForeignKeyField links one model to another. Tweet.user links each tweet to the user who wrote it. Favorite.user and Favorite.tweet together record which users have favorited which tweets.

The following helper populates test data that the examples below will query:

def create_test_data():
    db.create_tables([User, Tweet, Favorite])

    users = {
        name: User.create(username=name)
        for name in ('huey', 'mickey', 'zaizee')
    }

    tweet_data = {
        'huey':   ('meow', 'hiss', 'purr'),
        'mickey': ('woof', 'whine'),
        'zaizee': (),
    }
    tweets = {}
    for username, contents in tweet_data.items():
        for content in contents:
            tweets[content] = Tweet.create(
                user=users[username],
                content=content)

    # huey favorites mickey's "whine",
    # mickey favorites huey's "purr",
    # zaizee favorites huey's "meow" and "purr".
    favorite_data = (
        ('huey',   ['whine']),
        ('mickey', ['purr']),
        ('zaizee', ['meow', 'purr']),
    )
    for username, contents in favorite_data:
        for content in contents:
            Favorite.create(user=users[username], tweet=tweets[content])

This gives the following data:

User

Tweet

Favorited by

huey

meow

zaizee

huey

hiss

huey

purr

mickey, zaizee

mickey

woof

mickey

whine

huey

zaizee

(no tweets)

Note

To log every query Peewee executes to the console - useful for verifying query counts while working through this document - add the following before running any queries:

import logging
logging.getLogger('peewee').addHandler(logging.StreamHandler())
logging.getLogger('peewee').setLevel(logging.DEBUG)

Foreign Keys

When you declare a ForeignKeyField, Peewee creates two things on the model: a field that stores the raw integer ID value, and a descriptor that resolves that ID into a full model instance on access.

tweet = Tweet.get(Tweet.content == 'meow')

# Accessing .user resolves the foreign key - Peewee issues a SELECT
# query to fetch the related User row.
print(tweet.user.username)  # 'huey'

# Accessing .user_id returns the raw integer stored in the column,
# without issuing any query.
print(tweet.user_id)  # 1

The _id suffix accessor is available for every foreign key field. Use it whenever only the ID value is needed, since it avoids the extra query entirely.

Lazy loading

By default, a ForeignKeyField is lazy-loaded: the related object is not fetched until the attribute is first accessed, at which point a SELECT query is issued automatically. This is convenient but can lead to performance problems - see The N+1 Problem below.

To disable lazy loading on a specific field, pass lazy_load=False. With lazy loading disabled, accessing the attribute returns the raw ID value rather than issuing a query, matching the behaviour of the _id accessor:

class Tweet(BaseModel):
    user = ForeignKeyField(User, backref='tweets', lazy_load=False)

for tweet in Tweet.select():
    # Returns the integer ID, not a User instance. No extra query.
    print(tweet.user)

# If the User data was eagerly loaded via a join, the full User
# instance is accessible as normal, even with lazy_load=False.
for tweet in Tweet.select(Tweet, User).join(User):
    print(tweet.user.username)

See also

The N+1 Problem explains when and why disabling lazy loading is useful.

Back-references

Every ForeignKeyField automatically creates a back-reference on the related model. The back-reference is a pre-filtered Select query that returns all rows pointing at a given instance.

In the example schema, Tweet.user is a foreign key to User. The backref='tweets' parameter means that every User instance gains a tweets attribute, which is a pre-filtered Select query:

>>> huey = User.get(User.username == 'huey')

>>> huey.tweets  # back-reference is a Select query.
<peewee.ModelSelect object at 0x...>

>>> for tweet in huey.tweets:
...     print(tweet.content)
meow
hiss
purr

Taking a closer look at huey.tweets, we can see that it is just a simple pre-filtered SELECT query:

>>> huey.tweets
<peewee.ModelSelect at 0x7f0483931fd0>

>>> huey.tweets.sql()
('SELECT "t1"."id", "t1"."content", "t1"."timestamp", "t1"."user_id"
  FROM "tweet" AS "t1" WHERE ("t1"."user_id" = ?)', [1])

A back-reference behaves like any other Select query and can be filtered, ordered, and chained:

recent = (huey.tweets
          .order_by(Tweet.timestamp.desc())
          .limit(2))

If no backref name is specified, Peewee generates one automatically using the pattern <lowercase_classname>_set. Specifying an explicit backref is recommended for clarity.

The N+1 Problem

The N+1 problem occurs when code issues one query to fetch a list of N rows, then issues one or more additional queries per row to fetch related data - N+1 queries in total instead of one or two. At small scale this is invisible, but at production scale it can make pages that should take milliseconds take seconds.

Consider printing every tweet alongside its author’s username:

# Bad: issues 1 query for tweets + 1 query per tweet for the user.
for tweet in Tweet.select():
    print(tweet.user.username, '->', tweet.content)

# Good: only one query is needed.
query = (Tweet
         .select(Tweet, User)
         .join(User))

for tweet in query:
    # tweet.user is a User instance populated from the joined data.
    # No additional query is issued.
    print(tweet.user.username, '->', tweet.content)

Without joining and selecting the related User, each access to tweet.user triggers a SELECT on the user table. With five tweets, this produces six queries. With five thousand tweets, it produces five thousand and one.

The same problem can occur when iterating over back-references:

# Bad: issues 1 query for users + 1 query per user for their tweets.
for user in User.select():
    print(user.username)
    for tweet in user.tweets:  # A new query for each user.
        print('  ', tweet.content)

# Better:
for user in User.select().prefetch(Tweet):
    print(user.username)
    for tweet in user.tweets:  # Pre-fetched, no additional query.
        print('  ', tweet.content)

Peewee provides two complementary tools for avoiding N+1 queries:

  • Joins - combine rows from multiple tables in a single SELECT. Best when traversing a foreign key toward its target (many-to-one direction), for example fetching tweets with their authors.

  • Prefetch - issue one query per table and stitch the results together in Python. Best when traversing a back-reference (one-to-many direction), for example fetching users with all their tweets.

Both are covered in the sections below. The choice between them depends on the shape of the query.

Joins

A SQL join combines columns from two or more tables into a single result set. Peewee’s join() method generates the appropriate JOIN clause and, when the full result is returned as model instances, reconstructs the model graph automatically.

Join context

Peewee tracks a join context: the model from which the next join() call will depart. At the start of a query the join context is the model being selected from. Each call to join() moves the join context to the model just joined.

# Context starts at Tweet.
# After .join(User), context moves to User.
query = Tweet.select().join(User)

When joining through multiple tables in a chain, this is usually what you want. When joining from one model to two different models, the join context needs to be reset explicitly using switch() or join_from().

Peewee infers the join predicate (the ON clause) from the foreign keys defined on the models. If only one foreign key exists between two models, no additional specification is required. If multiple foreign keys exist, the relevant one must be specified explicitly.

The following code is equivalent to the prevoius example:

query = (Tweet
         .select()
         .join(User, on=(Tweet.user == User.id))
         .where(User.username == 'huey'))

Simple joins

To fetch all of huey’s tweets, join from Tweet to User and filter on the username:

query = (Tweet
         .select()
         .join(User)
         .where(User.username == 'huey'))

for tweet in query:
    print(tweet.content)

Peewee inferred the join predicate since Tweet.user is the only key between the two models. To explicitly specify the join predicate use on=:

query = (Tweet
         .select()
         .join(User, on=(Tweet.user == User.id))
         .where(User.username == 'huey'))

If a User instance is already available, the back-reference is simpler and equivalent for straightforward cases:

huey = User.get(User.username == 'huey')
for tweet in huey.tweets:
    print(tweet.content)

The join is the better choice when filtering or joining further. The back-reference is more readable for simple access to related rows.

Joining across multiple tables

To count how many favorites each user has received across all their tweets, a join must traverse User -> Tweet -> Favorite. Because each join moves the context forward, this chain can be written directly:

# Context: User -> join -> Tweet -> join -> Favorite
query = (User
         .select(User.username, fn.COUNT(Favorite.id).alias('fav_count'))
         .join(Tweet, JOIN.LEFT_OUTER)
         .join(Favorite, JOIN.LEFT_OUTER)
         .group_by(User.username))

for user in query:
    print(f'{user.username}: {user.fav_count} favorites received')

Both joins use LEFT OUTER because a user may have no tweets, and a tweet may have no favorites - yet both should appear in the result with a count of zero.

Switching join context

When a query needs to branch - joining from one model to two different models - the join context must be reset manually using switch().

To find all tweets by huey and how many times each has been favorited:

# Context: Tweet -> join -> User (context is now User)
# switch(Tweet) resets context to Tweet
# -> join -> Favorite (context is now Favorite)
query = (Tweet
         .select(Tweet.content, fn.COUNT(Favorite.id).alias('fav_count'))
         .join(User)
         .switch(Tweet)
         .join(Favorite, JOIN.LEFT_OUTER)
         .where(User.username == 'huey')
         .group_by(Tweet.content))

for tweet in query:
    print(f'{tweet.content}: favorited {tweet.fav_count} times')

Without the call to .switch(Tweet), Peewee would attempt to join from User to Favorite using Favorite.user, which would produce incorrect results.

Using join_from

join_from() is an alternative to switch().join() that makes the join source explicit in a single call. The above query can be written equivalently as:

query = (Tweet
         .select(Tweet.content, fn.COUNT(Favorite.id).alias('fav_count'))
         .join_from(Tweet, User)
         .join_from(Tweet, Favorite, JOIN.LEFT_OUTER)
         .where(User.username == 'huey')
         .group_by(Tweet.content))

join_from(A, B) is equivalent to switch(A).join(B) and is often more readable when a query branches across several paths.

Selecting columns from joined models

When columns from multiple models are included in select(), Peewee reconstructs the model graph and assigns related model instances to their corresponding attributes.

query = (Tweet
         .select(Tweet.content, User.username)
         .join(User))

for tweet in query:
    # tweet.user is a User instance populated from the joined data.
    # No additional query is issued.
    print(tweet.user.username, '->', tweet.content)

# huey -> meow
# huey -> hiss
# huey -> purr
# mickey -> woof
# mickey -> whine

To make it a bit more obvious that it’s doing the correct thing, we can ask Peewee to return the rows as dictionaries.

query = (Tweet
         .select(Tweet.content, User.username)
         .join(User)
         .dicts())

for row in query:
    print(row)

# {'content': 'meow', 'username': 'huey'}
# {'content': 'hiss', 'username': 'huey'}
# {'content': 'purr', 'username': 'huey'}
# {'content': 'woof', 'username': 'mickey'}
# {'content': 'whine', 'username': 'mickey'}

Compare these queries to the N+1 version: here, only one query is executed regardless of how many tweets are returned.

The attribute name that Peewee uses to store the joined instance follows the foreign key field name (tweet.user in this case). To override it, pass attr to join():

query = (Tweet
         .select(Tweet.content, User.username)
         .join(User, attr='author'))

for tweet in query:
    print(tweet.author.username, '->', tweet.content)

To flatten all selected columns onto the primary model instance rather than nesting them in a sub-object, append .objects():

query = (Tweet
         .select(Tweet.content, User.username)
         .join(User)
         .objects())

for tweet in query:
    # username is now an attribute on tweet directly.
    print(tweet.username, '->', tweet.content)

# huey -> meow

See Row Types for the different ways Peewee can return rows.

More complex example

As a more complex example, in this query, we will write a single query that selects all the favorites, along with the user who created the favorite, the tweet that was favorited, and that tweet’s author.

In SQL we would write:

SELECT owner.username, tweet.content, author.username AS author
FROM favorite
INNER JOIN user AS owner ON (favorite.user_id = owner.id)
INNER JOIN tweet ON (favorite.tweet_id = tweet.id)
INNER JOIN user AS author ON (tweet.user_id = author.id);

Note that we are selecting from the user table twice - once in the context of the user who created the favorite, and again as the author of the tweet.

With Peewee, we use Model.alias() to alias a model class so it can be referenced twice in a single query:

Owner = User.alias()
query = (Favorite
         .select(Favorite, Tweet.content, User.username, Owner.username)
         .join_from(Favorite, Owner)  # Determine owner of favorite.
         .join_from(Favorite, Tweet)  # Join favorite -> tweet.
         .join_from(Tweet, User))     # Join tweet -> user.

We can iterate over the results and access the joined values in the following way. Note how Peewee has resolved the fields from the various models we selected and reconstructed the model graph:

for fav in query:
    print(fav.user.username, 'liked', fav.tweet.content, 'by', fav.tweet.user.username)

# huey liked whine by mickey
# mickey liked purr by huey
# zaizee liked meow by huey
# zaizee liked purr by huey

Subqueries

Peewee allows you to join on any table-like object, including subqueries or common table expressions (see Common Table Expressions). To demonstrate joining on a subquery, let’s query for all users and their latest tweet.

Here is the SQL:

SELECT tweet.*, user.*
FROM tweet
INNER JOIN (
    SELECT latest.user_id, MAX(latest.timestamp) AS max_ts
    FROM tweet AS latest
    GROUP BY latest.user_id) AS latest_query
ON ((tweet.user_id = latest_query.user_id) AND (tweet.timestamp = latest_query.max_ts))
INNER JOIN user ON (tweet.user_id = user.id)

We’ll do this by creating a subquery which selects each user and the timestamp of their latest tweet. Then we can query the tweets table in the outer query and join on the user and timestamp combination from the subquery.

# Define our subquery first. We'll use an alias of the Tweet model, since
# we will be querying from the Tweet model directly in the outer query.
Latest = Tweet.alias()
latest_query = (Latest
                .select(Latest.user, fn.MAX(Latest.timestamp).alias('max_ts'))
                .group_by(Latest.user)
                .alias('latest_query'))

# Our join predicate will ensure that we match tweets based on their
# timestamp *and* user_id.
predicate = ((Tweet.user == latest_query.c.user_id) &
             (Tweet.timestamp == latest_query.c.max_ts))

# We put it all together, querying from tweet and joining on the subquery
# using the above predicate.
query = (Tweet
         .select(Tweet, User)  # Select all columns from tweet and user.
         .join_from(Tweet, latest_query, on=predicate)  # Join tweet -> subquery.
         .join_from(Tweet, User))  # Join from tweet -> user.

Iterating over the query, we can see each user and their latest tweet.

for tweet in query:
    print(tweet.user.username, '->', tweet.content)

# huey -> purr
# mickey -> whine

There are a couple things you may not have seen before in the code we used to create the query in this section:

  • We used join_from() to explicitly specify the join context. We wrote .join_from(Tweet, User), which is equivalent to .switch(Tweet).join(User).

  • We referenced columns in the subquery using the magic .c attribute, for example latest_query.c.max_ts. The .c attribute is used to dynamically create column references.

  • Instead of passing individual fields to Tweet.select(), we passed the Tweet and User models. This is shorthand for selecting all fields on the given model.

Common-table Expressions

In the previous section we joined on a subquery, but we could just as easily have used a common-table expression (CTE). We will repeat the same query as before, listing users and their latest tweets, but this time we will do it using a CTE.

Here is the SQL:

WITH latest AS (
    SELECT user_id, MAX(timestamp) AS max_ts
    FROM tweet
    GROUP BY user_id)
SELECT tweet.*, user.*
FROM tweet
INNER JOIN latest
    ON ((latest.user_id = tweet.user_id) AND (latest.max_ts = tweet.timestamp))
INNER JOIN user
    ON (tweet.user_id = user.id)

This example looks very similar to the previous example with the subquery:

# Define our CTE first. We'll use an alias of the Tweet model, since
# we will be querying from the Tweet model directly in the main query.
Latest = Tweet.alias()
cte = (Latest
       .select(Latest.user, fn.MAX(Latest.timestamp).alias('max_ts'))
       .group_by(Latest.user)
       .cte('latest'))

# Our join predicate will ensure that we match tweets based on their
# timestamp *and* user_id.
predicate = ((Tweet.user == cte.c.user_id) &
             (Tweet.timestamp == cte.c.max_ts))

# We put it all together, querying from tweet and joining on the CTE
# using the above predicate.
query = (Tweet
         .select(Tweet, User)  # Select all columns from tweet and user.
         .join(cte, on=predicate)  # Join tweet -> CTE.
         .join_from(Tweet, User)  # Join from tweet -> user.
         .with_cte(cte))

We can iterate over the result-set, which consists of the latest tweets for each user:

for tweet in query:
    print(tweet.user.username, '->', tweet.content)

# huey -> purr
# mickey -> whine

Note

For more information about using CTEs, including information on writing recursive CTEs, see the Common Table Expressions section of the “Querying” document.

Multiple foreign keys to the same model

When two foreign keys on the same model both point at the same target model, Peewee cannot infer which one to use for a join. The field must be specified explicitly.

Consider a Relationship model recording which users follow which other users:

class Relationship(BaseModel):
    from_user = ForeignKeyField(User, backref='following')
    to_user = ForeignKeyField(User, backref='followers')

    class Meta:
        indexes = ((('from_user', 'to_user'), True),)

To find everyone that huey follows:

huey = User.get(User.username == 'huey')

following = (User
             .select()
             .join(Relationship, on=Relationship.to_user)
             .where(Relationship.from_user == huey))

To find everyone who follows huey:

followers = (User
             .select()
             .join(Relationship, on=Relationship.from_user)
             .where(Relationship.to_user == huey))

Passing the field instance to on= tells Peewee which foreign key column to use for the join.

Joining without a foreign key

A join can be performed on any two tables, even when no ForeignKeyField exists between them, by supplying an explicit join predicate as an expression:

query = (User
         .select(User, ActivityLog)
         .join(ActivityLog,
               on=(User.id == ActivityLog.object_id),
               attr='log')
         .where(
             (ActivityLog.activity_type == 'login') &
             (User.username == 'huey')))

for user in query:
    print(user.username, '->', user.log.description)

Self-joins

A self-join queries a model against an alias of itself. Use Model.alias() to create the alias:

# Find all categories and their immediate parent name.
class Category(BaseModel):
    name = TextField()
    parent = ForeignKeyField('self', null=True, backref='children')

Parent = Category.alias()
query = (Category
         .select(Category.name, Parent.name)
         .join(Parent, JOIN.LEFT_OUTER, on=(Category.parent == Parent.id))
         .order_by(Category.name))

for row in query:
    print(row.name, 'parent:', row.parent.name if row.parent else 'None')

See also

Recursive queries over self-referential structures are covered in Common Table Expressions using recursive CTEs.

Many-to-Many Relationships

A many-to-many relationship - where one row in table A can relate to many rows in table B and vice versa - requires an intermediate through table that holds pairs of foreign keys.

ManyToManyField

ManyToManyField provides a shortcut API that manages the through table automatically. It is suitable for simple cases where the through table requires no extra columns and complex querying is not needed.

class Student(BaseModel):
    name = TextField()

class Course(BaseModel):
    title = TextField()
    students = ManyToManyField(Student, backref='courses')

# Retrieve the auto-generated through model if direct access is needed.
Enrollment = Course.students.get_through_model()

db.create_tables([Student, Course, Enrollment])

huey = Student.create(name='Huey')
cs101 = Course.create(title='CS 101')

# Adding and removing relationships:
huey.courses.add(cs101)
huey.courses.add(Course.select().where(Course.title.contains('Math')))

cs101.students.remove(huey)
cs101.students.clear()   # Removes all students from this course.

# Querying through the field:
for course in huey.courses.order_by(Course.title):
    print(course.title)

Warning

ManyToManyField does not work correctly with model inheritance. The through table contains foreign keys back to the original models, and those pointers are not automatically updated for subclasses. For any model that will be subclassed, use an explicit through table instead.

Avoiding N+1 with Prefetch

Joins solve the N+1 problem when traversing from the many side toward the one side - for example, fetching tweets with their authors. Each tweet has exactly one author, so a join produces exactly one result row per tweet.

The situation is different when traversing from the one side toward the many side - for example, fetching users with all their tweets. A join in this direction produces one result row per tweet, which means users with multiple tweets appear multiple times in the result set. Deduplicating those rows in application code is awkward and error-prone.

prefetch() solves this by issuing one query per table, then stitching the results together in Python. Instead of O(n) queries for n rows, we will do O(k) queries for k tables:

# Two queries total, regardless of how many users or tweets there are:
# SELECT * FROM user
# SELECT * FROM tweet WHERE user_id IN (...)
users = User.select().prefetch(Tweet)

# Equivalent to above.
users = prefetch(User.select(), Tweet.select())

for user in users:
    print(user.username)
    for tweet in user.tweets:  # No additional query, user.tweets is a list.
        print(f'  {tweet.content}')

The models passed to prefetch() must be linked by foreign keys. Peewee infers the relationships and assigns the prefetched rows to the appropriate back-reference attribute on each instance.

Prefetch can span more than two tables. To fetch users, their tweets, and the favorites on each tweet in three queries:

users = prefetch(User.select(), Tweet.select(), Favorite.select())

for user in users:
    for tweet in user.tweets:
        print(f'{user.username}: {tweet.content} '
              f'({len(tweet.favorites)} favorites)')

Filtering prefetched rows

Both the outer query and the prefetch subqueries can carry WHERE clauses and other modifiers independently:

one_week_ago = datetime.date.today() - datetime.timedelta(days=7)

users = prefetch(
    User.select().order_by(User.username),
    Tweet.select().where(Tweet.timestamp >= one_week_ago),
)

The filter on Tweet applies only to the prefetched tweets; it does not affect which users are returned.

Choosing between joins and prefetch

Use a join when:

  • Traversing from the many side to the one side (tweet -> author).

  • Filtering on columns in the related table (tweets by users whose username starts with “h”).

  • Only a subset of related fields is needed.

Use prefetch when:

  • Traversing from the one side to the many side (user -> all their tweets).

  • The full set of related rows is needed for each parent row.

  • Nesting more than one level of related data (users -> tweets -> favorites).

Note

LIMIT on the outer query of a prefetch() call works as expected. Limiting the inner queries (the prefetched tables) is not directly supported and requires a manual approach - see Top N Per Group in the recipes document for techniques.

See also

prefetch() API reference.