Nested Queries with SQLAlchemy ORM

Posted by
on under

One of the most rewarding aspects of having a popular course online is that from time to time I get a question that forces me to learn something new. The other day a reader asked me how they can write a database query with an unusual ordering, and I had to stop for a minute (okay, it was more like half an hour) to think about how to do it within the context of a Flask and Flask-SQLAlchemy application. Are you ready to see some advanced SQLAlchemy action?

The Problem

This reader had a database with an orders table that looked more or less like this:

id customer_id order_date
1 1 2018-01-01
2 1 2018-01-05
3 3 2018-01-07
4 1 2018-02-06
5 3 2018-01-31
6 2 2018-02-01
7 3 2018-02-01
8 3 2018-01-20
9 2 2018-02-07

The question was about getting the items of this table sorted by the customer_id field, but instead of using a simple ascending or descending order, my reader needed the list to be sorted based on how long ago a customer placed their last order, starting from the most recent customer.

So basically, the desired order was this:

id customer_id order_date
9 2 2018-02-07
6 2 2018-02-01
4 1 2018-02-06
2 1 2018-01-05
1 1 2018-01-01
7 3 2018-02-01
5 3 2018-01-31
8 3 2018-01-20
3 3 2018-01-07

In this table, the entries for customer 2 appear first, because when you look at the original table, you can see that this customer placed the most recent order on February 7th, which is the one with id 9. This is the last order stored in the table, so this customer is the most recent and for that reason it goes first with its two orders. The entries are sorted in descending order by the order date.

The next group of entries are for customer 1, because this customer is the second most recent customer, with an order on February 6th. Once again, the three orders for this customer are next to each other, and sorted in descending order by their dates.

Finally, customer 3 is the oldest of the three to place an order, with the most recent order on February 1st. The four orders from this customer come at the bottom, again sorted in descending order by the order date.

I couldn't immediately come up with a SQLAlchemy query that can do this. Before you continue reading, do you want to see if you can figure this out on your own? To make it easier for you, I have created a gist on GitHub with a fully working example application and database that you can use. All you need to do is grab the file and write your query in the place indicated by comments!

Open the example gist (on a separate tab)

To use this application you need to create a virtual environment and install flask-sqlalchemy on it. Then just run the script. In this application I'm using an in-memory SQLite database, so you don't need to worry about creating databases, each time you run the script a brand new and clean database is created.

If you want to see how I solved this problem, read on.

The Subquery

The solution to the above problem cannot be done with a simple query (at least I don't think it can, but would love to be proved wrong!). To be able to sort the rows as required we'll need to create two queries and combine them.

The first part of the solution is to figure out the order in which the customers need to appear in the query. For this, we basically need to look at the last order for each customer. An easy way to do that is to condense or group the orders table.

In relational databases, a grouping operation looks at the values of a certain column and all the rows that have the same value are collapsed into a single row in a temporary grouped table. For our example, when we group by the customer_id field we will end up with a grouped table that has three rows, one per customer.

The tricky part is, however, how to consolidate all the rows that have the same customer_id into a single row. For this, SQL provides aggregate functions, which are functions that take a list of values and produce a single result. Common aggregate functions available in all SQL implementations are sum, avg, min, max and a few others. In our case, we want to know the date of the last order for each customer, so we can use max(order_date) to create our grouped table.

I'm not sure if this is true for other SQLAlchemy developers, but for non-trivial queries I often find it easier to come up with a solution using raw SQL, and then once I have it I adapt it to SQLAlchemy. In SQL, the grouping is done with the following statement:

SELECT customer_id, max(order_date) AS last_order_date FROM orders GROUP BY customer_id

This query will consolidate all the rows that have the same customer_id into one, and for the order_date column it will calculate the maximum of all the rows that are being grouped into one. The original table also has an id column with the order primary key, but you can see that I'm not making a reference to this column in the SELECT portion of this query, so that column will not be included in the grouped results, and this is what I want since I really have no sensible way to aggregate the id field.

If you were to run the above query on the example data I provided, the result would be:

customer_id last_order_date
1 2018-02-06
2 2018-02-07
3 2018-02-01

Now that we have a functioning SQL query, we can convert it to Flask-SQLAlchemy:

last_orders = db.session.query(
    Order.customer_id, db.func.max(Order.order_date).label('last_order_date')
).group_by(Order.customer_id).subquery()

If you use plain SQLAlchemy, you would replace the above db.session with your own session object. Flask-SQLAlchemy makes working with sessions much easier, so I always prefer to use this extension.

If you are used to run queries that begin with Order.query.something, or maybe session.query(Order).something when using plain SQLAlchemy, then the above must look strange. Note in the raw SQL example that I have two entities in the SELECT part. The standard SQLAlchemy queries that you are used to see are a simplified form that is appropriate when you are querying entire rows from a single table. If the query is supposed to return results that come from two or more tables, or like in this case, a combination of real and aggregate columns, then you have to use this more verbose form, which specifies all the things you need the query to return as arguments to the session.query() method.

The two arguments to session.query() are directly translated from the SQL's SELECT, which are the customer_id column and the max(order_date) aggregate. Note the AS last_order_date part of the SQL statement, which gives an alias to the aggregate column. In SQLAlchemy, the label() method applied to a column achieves the same result.

The grouping is done with the group_by() query method, which takes the column to use for the grouping as an argument, same as the GROUP BY counterpart in SQL.

The statement ends by calling subquery(), which tells SQLAlchemy that our intention for this query is to use it inside a bigger query instead of on its own. In practical terms, this means that we will be able to treat the subquery object as if it was a real database table, while in fact it is a temporary table that is created on the fly. You will see how that works in the next section.

The Join

Now that we know the order in which we want the customers to be returned, we have to incorporate that order into the original table. And the most straightforward way to do this is by first merging the orders table with the last_orders subquery obtained in the previous section.

To merge two tables in a relational database we use a join operation. The join will take each row in the orders table, match it against a corresponding row in the last_orders subquery and finally generate a new combined row that has the columns of both tables. The result of the join operation will be another on-the-fly table. With raw SQL, the join against the subquery would be done as follows:

SELECT * FROM orders JOIN (
  SELECT customer_id, max(order_date) AS last_order_date FROM order GROUP BY customer_id
) AS last_orders
ON orders.customer_id = last_orders.customer_id

Here we have the subquery from the previous section inside a JOIN( ... ) AS name construct. The last_orders name is then mapped to the subquery results. Other parts of the query can then use this name to reference individual columns of those results. The ON part specifies what is the criteria to join the two tables, and in this case it is a simple one, just match rows that have the same customer_id value.

In our example the join would return the following data:

id customer_id order_date last_order_date
1 1 2018-01-01 2018-02-06
2 1 2018-01-05 2018-02-06
3 3 2018-01-07 2018-02-01
4 1 2018-02-06 2018-02-06
5 3 2018-01-31 2018-02-01
6 2 2018-02-01 2018-02-07
7 3 2018-02-01 2018-02-01
8 3 2018-01-20 2018-02-01
9 2 2018-02-07 2018-02-07

Now that we have the last order date for each customer joined with the orders, we can sort the table by this virtual last_order_date column in descending order, and that satisfies the first sorting criteria in our problem statement:

SELECT * FROM orders JOIN (
  SELECT customer_id, max(order_date) AS last_order_date FROM order GROUP BY customer_id
) AS last_orders
ON orders.customer_id = last_orders.customer_id
ORDER BY last_order_date DESC

But we are not done yet, as there is a secondary ordering that we need to implement. Within each customer, we need to provide the results ordered by order date in descending order. This can be done with an additional ordering by the original order_date field.

Here is the complete SQL statement that returns the data in the requested order:

SELECT * FROM orders JOIN (
  SELECT customer_id, max(order_date) AS last_order_date FROM order GROUP BY customer_id
) AS last_orders
ON orders.customer_id = last_orders.customer_id
ORDER BY last_order_date DESC, orders.order_date DESC

The translation to SQLAlchemy is fairly direct, but we are going to separate the subquery to avoid having too much complexity in a single statement. Here is the SQLAlchemy version of the above query:

last_orders = db.session.query(
    Order.customer_id, db.func.max(Order.order_date).label('last_order_date')
).group_by(Order.customer_id).subquery()
query = Order.query.join(
    last_orders, Order.customer_id == last_orders.c.customer_id
).order_by(last_orders.c.last_order_date.desc(), Order.order_date.desc())

The last_orders subquery is a copy of the subquery I presented in the previous section. Note that at this point nothing is sent to the database yet, storing the subquery in a local variable ahead of time does not trigger an additional database query.

In the second statement we take the Order model and join it against the last_orders subquery. The subquery object works in a similar way to a SQLAlchemy table, so we can reference individual columns with the table.c.column_name syntax. The c confuses a lot of people, SQLAlchemy uses this unfortunately odd name as a container for columns in table objects.

The join() method takes two arguments, first the right-side table in the join (our last_orders subquery) and then the condition for the join, which is that the customer_id columns in both tables match. With the join in place the order can be specified and this is a direct translation of the two order statements from the SQL example. Note how the virtual last_order_date column is referenced with that weird c as last_orders.c.last_order_date, but then order_date in the Order model is referenced as a direct attribute. The difference here is that Order is a model, while last_orders is a generic table with results. Models have a higher level interface than tables, so they are nicer to use.

As final exercise, I wanted to see how my handcrafted SQL compares against the SQL generated by SQLAlchemy using the query above. In case you don't know this trick, you can obtain the SQL that SQLAlchemy generates for any query object just by casting the query to a string:

print(str(query))

The SQLAlchemy query above generates the following raw SQL:

SELECT 
    orders.id AS orders_id,
    orders.customer_id AS orders_customer_id,
    orders.order_date AS orders_order_date
FROM orders JOIN (
    SELECT
        orders.customer_id AS customer_id,
        max(orders.order_date) AS last_order_date
    FROM orders GROUP BY orders.customer_id
) AS anon_1
ON orders.customer_id = anon_1.customer_id
ORDER BY anon_1.last_order_date DESC, orders.order_date DESC

If you ignore the slightly increased verbosity of this generated statement, things are pretty much the same. SQLAlchemy likes to create an alias for every column in the query, so you see that the AS name construct is used a lot. The subquery is identical to the raw SQL one but SQLAlchemy lacking context gives it a generic anon_1 name, versus my more explicit last_orders.

Other than that the queries are the same, so I'm pretty satisfied with this solution. Do you know a better or more efficient way to get the same result? Let me know below in the comments.

Become a Patron!

Hello, and thank you for visiting my blog! If you enjoyed this article, please consider supporting my work on this blog on Patreon!

29 comments
  • #26 Keerthi said

    Is there a way to transform rows and column resulting from executing a query into a object tree?

    https://stackoverflow.com/questions/72734865/mapping-multiple-rows-representing-a-one-to-many-relationship

  • #27 Miguel Grinberg said

    @Keerthi: if the representation provided by SQLAlchemy is insufficient, you can always read the objects with SQLAlchemy and then create your own representation.

  • #28 Rustam said

    Hi Miguel!
    what do you think on approach of using lateral join in this case?

    https://sqlfordevs.com/for-each-loop-lateral-join

    will it provide better performance? any drawbacks?

  • #29 Miguel Grinberg said

    @Rustam: I don't think lateral joins will provide better performance than a single joined query as shown here.

Leave a Comment