One of the most rewarding aspects of having a popular course online is that from time to time I get a question that forces me to learn something new. The other day a reader asked me how they can write a database query with an unusual ordering, and I had to stop for a minute (okay, it was more like half an hour) to think about how to do it within the context of a Flask and Flask-SQLAlchemy application. Are you ready to see some advanced SQLAlchemy action?
This reader had a database with an orders table that looked more or less like this:
The question was about getting the items of this table sorted by the
customer_id field, but instead of using a simple ascending or descending order, my reader needed the list to be sorted based on how long ago a customer placed their last order, starting from the most recent customer.
So basically, the desired order was this:
In this table, the entries for customer
2 appear first, because when you look at the original table, you can see that this customer placed the most recent order on February 7th, which is the one with id
9. This is the last order stored in the table, so this customer is the most recent and for that reason it goes first with its two orders. The entries are sorted in descending order by the order date.
The next group of entries are for customer
1, because this customer is the second most recent customer, with an order on February 6th. Once again, the three orders for this customer are next to each other, and sorted in descending order by their dates.
3 is the oldest of the three to place an order, with the most recent order on February 1st. The four orders from this customer come at the bottom, again sorted in descending order by the order date.
I couldn't immediately come up with a SQLAlchemy query that can do this. Before you continue reading, do you want to see if you can figure this out on your own? To make it easier for you, I have created a gist on GitHub with a fully working example application and database that you can use. All you need to do is grab the file and write your query in the place indicated by comments!
Open the example gist (on a separate tab)
To use this application you need to create a virtual environment and install flask-sqlalchemy on it. Then just run the script. In this application I'm using an in-memory SQLite database, so you don't need to worry about creating databases, each time you run the script a brand new and clean database is created.
If you want to see how I solved this problem, read on.
The solution to the above problem cannot be done with a simple query (at least I don't think it can, but would love to be proved wrong!). To be able to sort the rows as required we'll need to create two queries and combine them.
The first part of the solution is to figure out the order in which the customers need to appear in the query. For this, we basically need to look at the last order for each customer. An easy way to do that is to condense or group the orders table.
In relational databases, a grouping operation looks at the values of a certain column and all the rows that have the same value are collapsed into a single row in a temporary grouped table. For our example, when we group by the
customer_id field we will end up with a grouped table that has three rows, one per customer.
The tricky part is, however, how to consolidate all the rows that have the same
customer_id into a single row. For this, SQL provides aggregate functions, which are functions that take a list of values and produce a single result. Common aggregate functions available in all SQL implementations are
max and a few others. In our case, we want to know the date of the last order for each customer, so we can use
max(order_date) to create our grouped table.
I'm not sure if this is true for other SQLAlchemy developers, but for non-trivial queries I often find it easier to come up with a solution using raw SQL, and then once I have it I adapt it to SQLAlchemy. In SQL, the grouping is done with the following statement:
SELECT customer_id, max(order_date) AS last_order_date FROM orders GROUP BY customer_id
This query will consolidate all the rows that have the same
customer_id into one, and for the
order_date column it will calculate the maximum of all the rows that are being grouped into one. The original table also has an
id column with the order primary key, but you can see that I'm not making a reference to this column in the
SELECT portion of this query, so that column will not be included in the grouped results, and this is what I want since I really have no sensible way to aggregate the
If you were to run the above query on the example data I provided, the result would be:
Now that we have a functioning SQL query, we can convert it to Flask-SQLAlchemy:
last_orders = db.session.query( Order.customer_id, db.func.max(Order.order_date).label('last_order_date') ).group_by(Order.customer_id).subquery()
If you use plain SQLAlchemy, you would replace the above
db.session with your own session object. Flask-SQLAlchemy makes working with sessions much easier, so I always prefer to use this extension.
If you are used to run queries that begin with
Order.query.something, or maybe
session.query(Order).something when using plain SQLAlchemy, then the above must look strange. Note in the raw SQL example that I have two entities in the
SELECT part. The standard SQLAlchemy queries that you are used to see are a simplified form that is appropriate when you are querying entire rows from a single table. If the query is supposed to return results that come from two or more tables, or like in this case, a combination of real and aggregate columns, then you have to use this more verbose form, which specifies all the things you need the query to return as arguments to the
The two arguments to
session.query() are directly translated from the SQL's
SELECT, which are the
customer_id column and the
max(order_date) aggregate. Note the
AS last_order_date part of the SQL statement, which gives an alias to the aggregate column. In SQLAlchemy, the
label() method applied to a column achieves the same result.
The grouping is done with the
group_by() query method, which takes the column to use for the grouping as an argument, same as the
GROUP BY counterpart in SQL.
The statement ends by calling
subquery(), which tells SQLAlchemy that our intention for this query is to use it inside a bigger query instead of on its own. In practical terms, this means that we will be able to treat the subquery object as if it was a real database table, while in fact it is a temporary table that is created on the fly. You will see how that works in the next section.
Now that we know the order in which we want the customers to be returned, we have to incorporate that order into the original table. And the most straightforward way to do this is by first merging the
orders table with the
last_orders subquery obtained in the previous section.
To merge two tables in a relational database we use a join operation. The join will take each row in the
orders table, match it against a corresponding row in the
last_orders subquery and finally generate a new combined row that has the columns of both tables. The result of the join operation will be another on-the-fly table. With raw SQL, the join against the subquery would be done as follows:
SELECT * FROM orders JOIN ( SELECT customer_id, max(order_date) AS last_order_date FROM order GROUP BY customer_id ) AS last_orders ON orders.customer_id = last_orders.customer_id
Here we have the subquery from the previous section inside a
JOIN( ... ) AS name construct. The
last_orders name is then mapped to the subquery results. Other parts of the query can then use this name to reference individual columns of those results. The
ON part specifies what is the criteria to join the two tables, and in this case it is a simple one, just match rows that have the same
In our example the join would return the following data:
Now that we have the last order date for each customer joined with the orders, we can sort the table by this virtual
last_order_date column in descending order, and that satisfies the first sorting criteria in our problem statement:
SELECT * FROM orders JOIN ( SELECT customer_id, max(order_date) AS last_order_date FROM order GROUP BY customer_id ) AS last_orders ON orders.customer_id = last_orders.customer_id ORDER BY last_order_date DESC
But we are not done yet, as there is a secondary ordering that we need to implement. Within each customer, we need to provide the results ordered by order date in descending order. This can be done with an additional ordering by the original
Here is the complete SQL statement that returns the data in the requested order:
SELECT * FROM orders JOIN ( SELECT customer_id, max(order_date) AS last_order_date FROM order GROUP BY customer_id ) AS last_orders ON orders.customer_id = last_orders.customer_id ORDER BY last_order_date DESC, orders.order_date DESC
The translation to SQLAlchemy is fairly direct, but we are going to separate the subquery to avoid having too much complexity in a single statement. Here is the SQLAlchemy version of the above query:
last_orders = db.session.query( Order.customer_id, db.func.max(Order.order_date).label('last_order_date') ).group_by(Order.customer_id).subquery() query = Order.query.join( last_orders, Order.customer_id == last_orders.c.customer_id ).order_by(last_orders.c.last_order_date.desc(), Order.order_date.desc())
last_orders subquery is a copy of the subquery I presented in the previous section. Note that at this point nothing is sent to the database yet, storing the subquery in a local variable ahead of time does not trigger an additional database query.
In the second statement we take the Order model and join it against the
last_orders subquery. The subquery object works in a similar way to a SQLAlchemy table, so we can reference individual columns with the
table.c.column_name syntax. The
c confuses a lot of people, SQLAlchemy uses this unfortunately odd name as a container for columns in table objects.
join() method takes two arguments, first the right-side table in the join (our
last_orders subquery) and then the condition for the join, which is that the
customer_id columns in both tables match. With the join in place the order can be specified and this is a direct translation of the two order statements from the SQL example. Note how the virtual
last_order_date column is referenced with that weird
last_orders.c.last_order_date, but then
order_date in the
Order model is referenced as a direct attribute. The difference here is that
Order is a model, while
last_orders is a generic table with results. Models have a higher level interface than tables, so they are nicer to use.
As final exercise, I wanted to see how my handcrafted SQL compares against the SQL generated by SQLAlchemy using the query above. In case you don't know this trick, you can obtain the SQL that SQLAlchemy generates for any query object just by casting the query to a string:
The SQLAlchemy query above generates the following raw SQL:
SELECT orders.id AS orders_id, orders.customer_id AS orders_customer_id, orders.order_date AS orders_order_date FROM orders JOIN ( SELECT orders.customer_id AS customer_id, max(orders.order_date) AS last_order_date FROM orders GROUP BY orders.customer_id ) AS anon_1 ON orders.customer_id = anon_1.customer_id ORDER BY anon_1.last_order_date DESC, orders.order_date DESC
If you ignore the slightly increased verbosity of this generated statement, things are pretty much the same. SQLAlchemy likes to create an alias for every column in the query, so you see that the
AS name construct is used a lot. The subquery is identical to the raw SQL one but SQLAlchemy lacking context gives it a generic
anon_1 name, versus my more explicit
Other than that the queries are the same, so I'm pretty satisfied with this solution. Do you know a better or more efficient way to get the same result? Let me know below in the comments.