Nested Queries with SQLAlchemy ORM

Posted by
on under

One of the most rewarding aspects of having a popular course online is that from time to time I get a question that forces me to learn something new. The other day a reader asked me how they can write a database query with an unusual ordering, and I had to stop for a minute (okay, it was more like half an hour) to think about how to do it within the context of a Flask and Flask-SQLAlchemy application. Are you ready to see some advanced SQLAlchemy action?

The Problem

This reader had a database with an orders table that looked more or less like this:

id customer_id order_date
1 1 2018-01-01
2 1 2018-01-05
3 3 2018-01-07
4 1 2018-02-06
5 3 2018-01-31
6 2 2018-02-01
7 3 2018-02-01
8 3 2018-01-20
9 2 2018-02-07

The question was about getting the items of this table sorted by the customer_id field, but instead of using a simple ascending or descending order, my reader needed the list to be sorted based on how long ago a customer placed their last order, starting from the most recent customer.

So basically, the desired order was this:

id customer_id order_date
9 2 2018-02-07
6 2 2018-02-01
4 1 2018-02-06
2 1 2018-01-05
1 1 2018-01-01
7 3 2018-02-01
5 3 2018-01-31
8 3 2018-01-20
3 3 2018-01-07

In this table, the entries for customer 2 appear first, because when you look at the original table, you can see that this customer placed the most recent order on February 7th, which is the one with id 9. This is the last order stored in the table, so this customer is the most recent and for that reason it goes first with its two orders. The entries are sorted in descending order by the order date.

The next group of entries are for customer 1, because this customer is the second most recent customer, with an order on February 6th. Once again, the three orders for this customer are next to each other, and sorted in descending order by their dates.

Finally, customer 3 is the oldest of the three to place an order, with the most recent order on February 1st. The four orders from this customer come at the bottom, again sorted in descending order by the order date.

I couldn't immediately come up with a SQLAlchemy query that can do this. Before you continue reading, do you want to see if you can figure this out on your own? To make it easier for you, I have created a gist on GitHub with a fully working example application and database that you can use. All you need to do is grab the file and write your query in the place indicated by comments!

Open the example gist (on a separate tab)

To use this application you need to create a virtual environment and install flask-sqlalchemy on it. Then just run the script. In this application I'm using an in-memory SQLite database, so you don't need to worry about creating databases, each time you run the script a brand new and clean database is created.

If you want to see how I solved this problem, read on.

The Subquery

The solution to the above problem cannot be done with a simple query (at least I don't think it can, but would love to be proved wrong!). To be able to sort the rows as required we'll need to create two queries and combine them.

The first part of the solution is to figure out the order in which the customers need to appear in the query. For this, we basically need to look at the last order for each customer. An easy way to do that is to condense or group the orders table.

In relational databases, a grouping operation looks at the values of a certain column and all the rows that have the same value are collapsed into a single row in a temporary grouped table. For our example, when we group by the customer_id field we will end up with a grouped table that has three rows, one per customer.

The tricky part is, however, how to consolidate all the rows that have the same customer_id into a single row. For this, SQL provides aggregate functions, which are functions that take a list of values and produce a single result. Common aggregate functions available in all SQL implementations are sum, avg, min, max and a few others. In our case, we want to know the date of the last order for each customer, so we can use max(order_date) to create our grouped table.

I'm not sure if this is true for other SQLAlchemy developers, but for non-trivial queries I often find it easier to come up with a solution using raw SQL, and then once I have it I adapt it to SQLAlchemy. In SQL, the grouping is done with the following statement:

SELECT customer_id, max(order_date) AS last_order_date FROM orders GROUP BY customer_id

This query will consolidate all the rows that have the same customer_id into one, and for the order_date column it will calculate the maximum of all the rows that are being grouped into one. The original table also has an id column with the order primary key, but you can see that I'm not making a reference to this column in the SELECT portion of this query, so that column will not be included in the grouped results, and this is what I want since I really have no sensible way to aggregate the id field.

If you were to run the above query on the example data I provided, the result would be:

customer_id last_order_date
1 2018-02-06
2 2018-02-07
3 2018-02-01

Now that we have a functioning SQL query, we can convert it to Flask-SQLAlchemy:

last_orders = db.session.query(
    Order.customer_id, db.func.max(Order.order_date).label('last_order_date')
).group_by(Order.customer_id).subquery()

If you use plain SQLAlchemy, you would replace the above db.session with your own session object. Flask-SQLAlchemy makes working with sessions much easier, so I always prefer to use this extension.

If you are used to run queries that begin with Order.query.something, or maybe session.query(Order).something when using plain SQLAlchemy, then the above must look strange. Note in the raw SQL example that I have two entities in the SELECT part. The standard SQLAlchemy queries that you are used to see are a simplified form that is appropriate when you are querying entire rows from a single table. If the query is supposed to return results that come from two or more tables, or like in this case, a combination of real and aggregate columns, then you have to use this more verbose form, which specifies all the things you need the query to return as arguments to the session.query() method.

The two arguments to session.query() are directly translated from the SQL's SELECT, which are the customer_id column and the max(order_date) aggregate. Note the AS last_order_date part of the SQL statement, which gives an alias to the aggregate column. In SQLAlchemy, the label() method applied to a column achieves the same result.

The grouping is done with the group_by() query method, which takes the column to use for the grouping as an argument, same as the GROUP BY counterpart in SQL.

The statement ends by calling subquery(), which tells SQLAlchemy that our intention for this query is to use it inside a bigger query instead of on its own. In practical terms, this means that we will be able to treat the subquery object as if it was a real database table, while in fact it is a temporary table that is created on the fly. You will see how that works in the next section.

The Join

Now that we know the order in which we want the customers to be returned, we have to incorporate that order into the original table. And the most straightforward way to do this is by first merging the orders table with the last_orders subquery obtained in the previous section.

To merge two tables in a relational database we use a join operation. The join will take each row in the orders table, match it against a corresponding row in the last_orders subquery and finally generate a new combined row that has the columns of both tables. The result of the join operation will be another on-the-fly table. With raw SQL, the join against the subquery would be done as follows:

SELECT * FROM orders JOIN (
  SELECT customer_id, max(order_date) AS last_order_date FROM order GROUP BY customer_id
) AS last_orders
ON orders.customer_id = last_orders.customer_id

Here we have the subquery from the previous section inside a JOIN( ... ) AS name construct. The last_orders name is then mapped to the subquery results. Other parts of the query can then use this name to reference individual columns of those results. The ON part specifies what is the criteria to join the two tables, and in this case it is a simple one, just match rows that have the same customer_id value.

In our example the join would return the following data:

id customer_id order_date last_order_date
1 1 2018-01-01 2018-02-06
2 1 2018-01-05 2018-02-06
3 3 2018-01-07 2018-02-01
4 1 2018-02-06 2018-02-06
5 3 2018-01-31 2018-02-01
6 2 2018-02-01 2018-02-07
7 3 2018-02-01 2018-02-01
8 3 2018-01-20 2018-02-01
9 2 2018-02-07 2018-02-07

Now that we have the last order date for each customer joined with the orders, we can sort the table by this virtual last_order_date column in descending order, and that satisfies the first sorting criteria in our problem statement:

SELECT * FROM orders JOIN (
  SELECT customer_id, max(order_date) AS last_order_date FROM order GROUP BY customer_id
) AS last_orders
ON orders.customer_id = last_orders.customer_id
ORDER BY last_order_date DESC

But we are not done yet, as there is a secondary ordering that we need to implement. Within each customer, we need to provide the results ordered by order date in descending order. This can be done with an additional ordering by the original order_date field.

Here is the complete SQL statement that returns the data in the requested order:

SELECT * FROM orders JOIN (
  SELECT customer_id, max(order_date) AS last_order_date FROM order GROUP BY customer_id
) AS last_orders
ON orders.customer_id = last_orders.customer_id
ORDER BY last_order_date DESC, orders.order_date DESC

The translation to SQLAlchemy is fairly direct, but we are going to separate the subquery to avoid having too much complexity in a single statement. Here is the SQLAlchemy version of the above query:

last_orders = db.session.query(
    Order.customer_id, db.func.max(Order.order_date).label('last_order_date')
).group_by(Order.customer_id).subquery()
query = Order.query.join(
    last_orders, Order.customer_id == last_orders.c.customer_id
).order_by(last_orders.c.last_order_date.desc(), Order.order_date.desc())

The last_orders subquery is a copy of the subquery I presented in the previous section. Note that at this point nothing is sent to the database yet, storing the subquery in a local variable ahead of time does not trigger an additional database query.

In the second statement we take the Order model and join it against the last_orders subquery. The subquery object works in a similar way to a SQLAlchemy table, so we can reference individual columns with the table.c.column_name syntax. The c confuses a lot of people, SQLAlchemy uses this unfortunately odd name as a container for columns in table objects.

The join() method takes two arguments, first the right-side table in the join (our last_orders subquery) and then the condition for the join, which is that the customer_id columns in both tables match. With the join in place the order can be specified and this is a direct translation of the two order statements from the SQL example. Note how the virtual last_order_date column is referenced with that weird c as last_orders.c.last_order_date, but then order_date in the Order model is referenced as a direct attribute. The difference here is that Order is a model, while last_orders is a generic table with results. Models have a higher level interface than tables, so they are nicer to use.

As final exercise, I wanted to see how my handcrafted SQL compares against the SQL generated by SQLAlchemy using the query above. In case you don't know this trick, you can obtain the SQL that SQLAlchemy generates for any query object just by casting the query to a string:

print(str(query))

The SQLAlchemy query above generates the following raw SQL:

SELECT 
    orders.id AS orders_id,
    orders.customer_id AS orders_customer_id,
    orders.order_date AS orders_order_date
FROM orders JOIN (
    SELECT
        orders.customer_id AS customer_id,
        max(orders.order_date) AS last_order_date
    FROM orders GROUP BY orders.customer_id
) AS anon_1
ON orders.customer_id = anon_1.customer_id
ORDER BY anon_1.last_order_date DESC, orders.order_date DESC

If you ignore the slightly increased verbosity of this generated statement, things are pretty much the same. SQLAlchemy likes to create an alias for every column in the query, so you see that the AS name construct is used a lot. The subquery is identical to the raw SQL one but SQLAlchemy lacking context gives it a generic anon_1 name, versus my more explicit last_orders.

Other than that the queries are the same, so I'm pretty satisfied with this solution. Do you know a better or more efficient way to get the same result? Let me know below in the comments.

Become a Patron!

Hello, and thank you for visiting my blog! If you enjoyed this article, please consider supporting my work on this blog on Patreon!

29 comments
  • #1 smitty said

    I like SQLAlchemy for DDL and connections, but I usually do zany things that are more simply written in SQL as bare text.

  • #2 Miguel Grinberg said

    @smitty: I prefer to convert my queries back into SQLAlchemy, so that I can use a consistent way to run queries and access results, but I agree, it is easier to write your complex queries directly in SQL first.

  • #3 Joe Gasewicz said

    Thanks, Miguel, great article as usual!

  • #4 ML said

    @Miguel #2:

    Other than for the sake of consistency, are there other benefits to converting complex queries back into SQLAlchemy instead of leaving it in the raw SQL?

    Thanks and appreciate all you do.

  • #5 Miguel Grinberg said

    @ML: It's mainly just so that you can use your model classes, but having the query generated by SQLAlchemy also gives you the additional benefit that there is no possibility of a SQL injection. If you do your query in raw SQL you have to remember to do proper escaping to avoid this risk.

  • #6 Tiago said

    Great!

  • #7 Guy Alexis said

    I have a mysql query and I'm trying to use nested queries to rewrite as sqlalchemy can compile. This is my query

    Select m.libelle_motif, case when valeur> 0 then valeur else 0 end as valeur
    from (SELECT motif.libelle_motif AS motif_libelle_motif, count(demandes.id_demande) AS valeur
    FROM motif, demandes, customers
    WHERE demandes.stand_id = 11
    AND customers.identifiant_customers = ""
    AND customers.reference_customers = ""
    AND demandes.type_motif_id = 3
    AND demandes.campagne_id = 2 AND demandes.motif_id = motif.id_motif AND demandes.customer_id = customers.id_customers
    GROUP BY motif.libelle_motif) T
    right join motif m on T.motif_libelle_motif = m.libelle_motif;

  • #8 Miguel Grinberg said

    @Guy: Sorry, you'll have to figure it out by yourself. It would take some time to get it right, and you need to have a database to test against, so you are better positions than me to do this.

  • #9 kai said

    great, it's really helpful for new beginners

  • #10 Batyr said

    Hi Miguel, Thanks for a great tutorial, i have a question regarding usage of subqueries. I used it in my project but it doesn't seem to give me the right output. How can I access the elements from the subquery? When I use in join it gives me the data of the queried table but not the data from subquery.
    For example:
    cp = db.session.query(CustomerPurchase.category_id, func.sum(CustomerPurchase.amount)).filter(CustomerPurchase.created_on >= start_date).group_by(CustomerPurchase.category_id).subquery()

    And then:
    bp = CustomerPurchase.query.join(cp, cp.c.category_id == CustomerPurchase.category_id).order_by(cp.c.category_id)

    So it gives me the rows but it doesn't SUM the column amount and doesn't merge the duplicates...

  • #11 Miguel Grinberg said

    @Batyr: You are running your query starting from CustomerPurchase.query, so the results are going to be objects from that table. You can use the generic db.session.query(A, B, C).join(...) to get results from several tables.

  • #12 luckystripe said

    Miguel, i'm thrilled to have learned something huge about flask and how to query using sqlalchemy. in the past i've given up on understanding sqlalchemy, since everything i reads eemed not to make sense to me. but looking at your github gist, it worked and made total sense.
    i did not know you could just loop through the database just using (for example) for item in User.query: the rest of the logic is in python. thats fantastic!
    but heres the question. i was under the impression that sqlalchemy writes translates python into sql queries. if everything is written in python without using the built in flask filters such as these:
    User.query.order_by(User.name).all()
    User.query.filter(User.email.endswith('aaa')).all()
    User.query.order_by(User.name).all()
    User.query.limit(1).all()
    User.query.get(1)
    User.query.all()
    as per the documentation on the flask site.
    i guess my question is, when you use python without interacting with the flask-sqlalchemy module, is it being translated to sql statements on the back end? or is python just pulling the queried output from loaded memory and not doin a more optimized and efficent query that sql was designed to do? im still new at this, i hope i'm making sense. woudl writing all the loging to query in python and bypassing the aboved mentioned module functions make it less optimal and less scalable?

    thank you so much for your work and good post!!

  • #13 Miguel Grinberg said

    @luckystripe: SQLAlchemy uses a lot of advanced and sometimes obscure techniques to make it possible to write fairly normal looking Python code that is translated to SQL before it executes. For any query that you have, you can always see what is the generated SQL. Example:

    query = User.query.filter(User.email.endswith('aaa'))
    print(str(query))

    This will show you what SQL is going to be sent to the database. Note that I took your example and remove the final .all() from it, because this needs to be done on a query object that hasn't executed yet. Hope this helps!

  • #14 luckystripe said

    for the sake of the example, lets say Users had a few million records and more fields.

    what if we did something like this:

    query = User.query
    for item in query:
    if query.name.startswith('k') and age >= 21:
    p1 = query.Field6
    if query.Field9 is > 200 and query.Field9 < 350:
    p2 = query.Field9
    print('{}{}{}{}'.format(p1, p2, query.name, query.email, )) ##or output to the jinja templates

    or lets say we had alot more processing to do than the above (but all done in python). would this be something you would not want to do?
    does this mean that the whole database of our imaginary few million records will be interated through
    and the rest of the logic is done in python?

    does sql work more optimally? like instead of iterating through a few million records it does whatever magic it does?
    how would you print(str(query)) on this? or is there no query made since it will iterate through the entire database
    and the filter and logic done in python. or the sql would look like FROM all SELECT all (or something like that lol)

    is this more computationally expensive to do?
    would it be better to fit it through the flask-sqlalchemy module to handle?

  • #15 Miguel Grinberg said

    @luckystripe: If the name and field9 columns in your database are indexed, then the database would not need to review every single of those million records to find those that satisfy the conditions (relational indexes are structures that are optimized to do fast searches). Also consider that Python is in general a slow language, while most databases are written in highly optimized C or assembly code. So yes, for a well designed database, SQL queries would be way faster than queries implemented in Python. No doubt about it.

    If you want to print the records that satisfy your condition, then run the SQL query then iterate over the results and print them. All you need to do in the Python side is just print every record that is returned by the query.

  • #16 Bernardo Gomes de Abreu said

    Good article ! Keep sharing !!

    A simple sugestion to you write to your readers is a example when you need use aliases in the querie (Use the same table twice to do the querie with filters on both tables.).

  • #17 Rustam said

    Flask-SQLAlchemy makes working with sessions much easier

    Please share some tutorials supporting your statement above.
    Thanks!

  • #18 Miguel Grinberg said

    @Rustam: sorry but I don't understand what you are asking. All you need to do to see what I mean is try to use Flask with SQLAlchemy without using Flask-SQLAlchemy.

  • #19 D said

    Thank you - this has been quite helpful! I have also found your video on setting up pycharm community edition with a flask project EXTREMELY helpful.

  • #20 FBN said

    You can use "modern" SQL to get an answer with a single query with the same performance or even better.
    I pretty sure you can translate this to SQLAlchemy syntax, but for complex report queries I'm still not convinced if it's worth the effort.

    Here is the same query using a window function on the 'order by' clause:

    SELECT
    id,
    customer_id,
    order_date
    FROM
    orders
    ORDER BY
    max(order_date) OVER (PARTITION BY customer_id) DESC,
    order_date DESC

  • #21 FBN said

    Here is the single query in SQLAlchemy syntax.
    Thank you for the post and for the interesting problem.

    query = Order.query.order_by(
        db.func.max(Order.order_date).over(partition_by=Order.customer_id).desc(),
        Order.order_date.desc(),
    )
    
  • #22 Krisztian said

    Hi @Miguel Grinberg,

    I believe you haven't accounted for the case where two separate customers would have the same last_order_date.
    A fix would be to order 2nd on customer_id and then 3rd on order_date.

    Krisztian

  • #23 Skygen said

    Thank you, Miguel!

  • #24 Kurt said

    Hi Miguel, I'm trying to display two columns from different tables. In plain old sql it would translate to
    SELECT t1.CompanyName, t2.UserEmail FROM Company AS t1, User AS t2 WHERE t1.CompanyId = t2.CompanyId AND t1.CompanyName LIKE %store%
    (my query and number of columns is in reality a bit bigger than that, but the principle is the same)
    So i have tried to translate this into sqalchemy orm with something like this query = db.session.query(Company, User).columns(Company.Name, User.email).join(User, User.Companyid == Company.id).filter(Company.CompanyName.like(f'%{sSearch}%'))
    where sSearch is the variable which would contain e.g. 'store' or whathever one would type into the datatable search box. However, sqalchemy only returns the company id instead of the users email. I've tried a couple other combinations that I've found on the sqalchemy site and other places, but it always only return the id instead of the email. Thought this would be a quite common thing to do, but really struggling to find any good examples in the sqalchemy site as well as elsewhere. Most examples show joins where all the returned columns are from the same table. Do you manage to see what I do wrong?
    Kind regards
    Kurt

  • #25 Miguel Grinberg said

    @Kurt: I don't really see what's wrong, sorry. But you can debug this query. You can do print(str(query)) to see what is the SQL that was generated by the ORM, and keep fine tuning things until you get exactly what you need.

Leave a Comment