2021-09-01T13:23:29Z

Optimizing Slow SQL Queries

Most database problems go unnoticed during development, because we tend to code our applications using small datasets. It is when the application has been on production for some time that database performance issues start to appear, causing parts of the application to become slower and slower as the database continues to grow.

How do you debug and identify this type of problems? In this article I'm going to show you how to fix the most common database performance problems, which are those that are caused by improper indexing. Examples for Postgres, MySQL and SQLite are included!

This article was voted by my supporters on Patreon. Would you like to support my work, and as a thank you be able to vote on my future articles and have access to a chat room where I hang out and answer questions? Become a Patron!

How the Database Resolves a Query

To be able to optimize database queries, it is useful to first review how databases implement the resolution of a query.

Once the SQL code for the query is parsed, a query optimizer will look at it and come up with a sequence of operations that need to be performed to produce the results that are requested. For a simple query, this may be a single operation. Complex queries that involve joins, grouping or similar constructs require several operations, each building on the results of the previous one.

There are many different primitive operations, but if we only look at those that retrieve data, they can be divided into two main groups: operations that look for the required data by reading sequentially (sometimes called "sequential scans" or "table scans"), and operations that look for data by searching an index (often called "index searches", or "index scans").

Are sequential scans slower than index scans? In general yes, but there is no absolute answer. For database tables that have a small number of rows, the performance of scans versus index searches is not that different, and depending on each particular case one may perform slightly better than the other. It wouldn't be uncommon when working with small tables that the database prefers to use sequential scans even when an index is available.

But for tables that are large, sequential scans are often at the core of query performance problems, because the larger the table gets, the slower the scan operation will be. To enable the database to give you its best performance you have to make sure that you configure your tables with the proper indexes, and in this way the query optimizer will have more options to work with.

So as you see, the key in optimizing your slow queries is to find those long sequential scans that the database is performing, and adding the proper indexes to the affected tables so that the database can replace the scans with equivalent operations that use indexes.

Identifying Slow Queries

Unfortunately there is no global way to assess your database to identify what indexes you can add to increase performance. The indexes that you need are directly related to the queries that you make, so the process starts by identifying the queries that perform poorly in your application.

How can you find slow queries? There are a number of strategies. The simplest one is to listen to your users. A coworker the other day told me that some requests he was making to a Flask API that I maintain were taking about 20 seconds to return. Once I obtained the exact details for these requests I was able to identify the query that needed to be optimized.

Another option: many databases and database frameworks can be configured to log queries that take more than a specified amount of time. This is great to enable on a production deployment. Every once in a while you can review your slow query log to see if there are any queries that need to be investigated.

How to Find and Fix Sequential Scans

Once you have a target query that needs to be optimized, you can move on to the next step, which is to find if there are any slow sequential scans in the query that can be eliminated by adding an index.

The good news is that given a query, you can ask your relational database to tell you what are the operations it needs to perform to resolve it. This is actually implemented with the EXPLAIN SQL keyword that is part of the SQL specification.

The three databases that I normally work with are MySQL, Postgres and SQLite and all three are able to "explain" queries to help you debug performance issues.

An Example Database Schema

What I'm going to do next is show you a small schema that I will then use to create Postgres, MySQL and SQLite databases. I'll finally show how to analyze some database queries in each of them.

Since I normally work with Python, I'm going to show you the schema that I'm going to use as required by the SQLAlchemy database framework and ORM. Even if you don't work with Python, these classes should be very easy to understand.

The example database keeps tracks of blog articles with the authors. Here is its definition:

class Article(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    slug = db.Column(db.String(256), index=True, unique=True, nullable=False)
    title = db.Column(db.String(256))
    author_id = db.Column(db.Integer, db.ForeignKey('author.id'))


class Author(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    name = db.Column(db.String(256), index=True, unique=True, nullable=False)

Note that in this schema, the Article.slug column is indexed, while the Article.title column is not. This is important, because it will allow us to see how indexed and non-indexed column searches are handled in the different databases.

Postgres

Before I begin with the performance analysis under Postgres, here is how SQLAlchemy generated the Postgres tables that correspond to the classes I showed you in the previous section:

test=# \dt
             List of relations
 Schema |      Name      | Type  |  Owner
--------+----------------+-------+----------
 public | article        | table | postgres
 public | author         | table | postgres

test=# \d article
                                         Table "public.article"
      Column       |          Type          | Collation | Nullable |               Default
-------------------+------------------------+-----------+----------+-------------------------------------
 id                | integer                |           | not null | nextval('article_id_seq'::regclass)
 slug              | character varying(256) |           | not null |
 title             | character varying(256) |           |          |
 author_id         | integer                |           |          |
Indexes:
    "article_pkey" PRIMARY KEY, btree (id)
    "ix_article_slug" UNIQUE, btree (slug)
Foreign-key constraints:
    "article_author_id_fkey" FOREIGN KEY (author_id) REFERENCES author(id)

test=# \d author
                                    Table "public.author"
 Column |          Type          | Collation | Nullable |              Default
--------+------------------------+-----------+----------+------------------------------------
 id     | integer                |           | not null | nextval('author_id_seq'::regclass)
 name   | character varying(256) |           | not null |
Indexes:
    "author_pkey" PRIMARY KEY, btree (id)
    "ix_author_name" UNIQUE, btree (name)
Referenced by:
    TABLE "article" CONSTRAINT "article_author_id_fkey" FOREIGN KEY (author_id) REFERENCES author(id)

Optimizing Simple Queries

Using the article table, I can get the first 5 articles in alphabetical slug order with the following query:

SELECT * FROM article ORDER BY article.slug LIMIT 5;

Let's use EXPLAIN on this query, simply by adding it as a prefix:

test=# EXPLAIN SELECT * FROM article ORDER BY article.slug LIMIT 5;

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Limit  (cost=0.27..1.26 rows=5 width=139)
   ->  Index Scan using ix_article_slug on article  (cost=0.27..43.38 rows=217 width=139)
(2 rows)

There are two operations required by this query. The Limit operation is not really that interesting, but you can see that this operation depends on an index scan based on the index I have defined on the slug column, which is used to retrieve those first five slugs in alphabetical order. This is actually good; an index scan is fairly efficient compared to a sequential scan.

Postgres has an operation that is even more performant than the index scan, which is the "index only scan". This one is used when the data that the query needs can be obtained directly from the index itself, so there is no need to read any data from the table after the index was searched. The following example queries just the slugs instead of all the columns of the table, and since the slug is the indexed column Postgres decides to use the most efficient operation:

test=# EXPLAIN SELECT slug FROM article ORDER BY article.slug LIMIT 5;

                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Limit  (cost=0.27..1.26 rows=5 width=49)
   ->  Index Only Scan using ix_article_slug on article  (cost=0.27..43.38 rows=217 width=49)
(2 rows)

What happens if you evaluate a similar query but using the title column, which is not indexed? Here is the result:

test=# EXPLAIN SELECT * FROM article ORDER BY article.title LIMIT 5;

                              QUERY PLAN
-----------------------------------------------------------------------
 Limit  (cost=10.77..10.79 rows=5 width=139)
   ->  Sort  (cost=10.77..11.32 rows=217 width=139)
         Sort Key: title
         ->  Seq Scan on article  (cost=0.00..7.17 rows=217 width=139)
(4 rows)

You can see that now the query has a third operation, a sort. In the previous case the alphabetical sorting can be obtained directly from the index, but title is not indexed so the sort in this query must happen in memory. Reading this query from bottom to top, first a sequential scan brings all the rows into memory, then the rows are sorted, and finally the first five of the rows are returned.

This query would become slower and slower as the number of rows in the article table grow. In the example above you can see that I have 217 rows in my table. Imagine how much more effort it would be to run this query with 10,000 articles instead, all having to be read into memory to be sorted, only to return the first five and discard the rest.

Optimizing Joins

Let's review what happens with a query that uses a join. The following query returns the first five articles and the authors, sorted alphabetically by the author's name:

test=# EXPLAIN SELECT article.*, author.name FROM article, author WHERE article.author_id = author.id ORDER BY author.name LIMIT 5;

                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Limit  (cost=0.14..8.36 rows=5 width=153)
   ->  Nested Loop  (cost=0.14..356.66 rows=217 width=153)
         Join Filter: (article.author_id = author.id)
         ->  Index Scan using ix_author_name on author  (cost=0.14..13.69 rows=103 width=18)
         ->  Materialize  (cost=0.00..8.25 rows=217 width=139)
               ->  Seq Scan on article  (cost=0.00..7.17 rows=217 width=139)
(6 rows)

If you prefer the alternative way of defining this query using JOIN, rest assured that it gives an identical result:

test=# EXPLAIN SELECT article.*, author.name FROM article JOIN author ON article.author_id = author.id ORDER BY author.name LIMIT 5;

                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Limit  (cost=0.14..8.36 rows=5 width=153)
   ->  Nested Loop  (cost=0.14..356.66 rows=217 width=153)
         Join Filter: (article.author_id = author.id)
         ->  Index Scan using ix_author_name on author  (cost=0.14..13.69 rows=103 width=18)
         ->  Materialize  (cost=0.00..8.25 rows=217 width=139)
               ->  Seq Scan on article  (cost=0.00..7.17 rows=217 width=139)
(6 rows)

If we start reading from the bottom, the bad news here is that we start with a sequential scan of the article table. The results of the scan are "materialized" as an in-memory table. The join operation then combines this data along with an index scan of the name column of the author table. The join in this case is evaluated with the "nested loop" algorithm, which is the most basic (and less performant) join operation in Postgres.

You may be wondering why is Postgres going to a sequential scan of all the articles, when it could have used the author_id foreign key. The answer may surprise you: many databases (Postgres among them) do not automatically create an index for foreign keys! Primary keys are automatically indexed, but foreign keys must be explicitly indexed if you need them to be efficient.

To optimize this query, we can add an index on the author_id foreign key. In the SQLAlchemy schema in Python, this would be:

class Article(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    slug = db.Column(db.String(256), index=True, unique=True, nullable=False)
    title = db.Column(db.String(256))
    author_id = db.Column(db.Integer, db.ForeignKey('author.id'), index=True)  # <-- add an index here

Or directly in SQL:

CREATE INDEX ix_article_author_id ON article (author_id);

And now we can repeat the query and get a much better resolution:

test=# EXPLAIN SELECT article.*, author.name FROM article, author WHERE article.author_id = author.id ORDER BY author.name LIMIT 5;

                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Limit  (cost=0.29..1.72 rows=5 width=153)
   ->  Nested Loop  (cost=0.29..62.29 rows=217 width=153)
         ->  Index Scan using ix_author_name on author  (cost=0.14..13.69 rows=103 width=18)
         ->  Index Scan using ix_article_author_id on article  (cost=0.14..0.45 rows=2 width=139)
               Index Cond: (author_id = author.id)
(5 rows)

This is indeed better, but I was hoping Postgres would use a more efficient join algorithm than nested loop. I think in this case it decided that a nested loop was the best option, because in the end we are looking for just five entries. You can see that the scan of the new index on the author_id foreign key only needed to look at 2 rows out of the 217 in the table, so the nested loop would not end up being very expensive. If my tables had a lot more rows in them, or if I had asked for more than five results, the resolution of the query might have been different. For this reason it is very important that you use the EXPLAIN statement on specific queries that are slow and that you do the analysis on the actual database (or a mirror of it).

MySQL

To continue, let's now use the same schema on a MySQL database. Here is how SQLAlchemy generated the MySQL database:

mysql> DESCRIBE article;
+-------------------+--------------+------+-----+---------+----------------+
| Field             | Type         | Null | Key | Default | Extra          |
+-------------------+--------------+------+-----+---------+----------------+
| id                | int          | NO   | PRI | NULL    | auto_increment |
| slug              | varchar(256) | NO   | UNI | NULL    |                |
| title             | varchar(256) | YES  |     | NULL    |                |
| author_id         | int          | YES  | MUL | NULL    |                |
+-------------------+--------------+------+-----+---------+----------------+
4 rows in set (0.01 sec)

mysql> SHOW INDEX FROM article;
+---------+------------+------------------------------+--------------+-------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table   | Non_unique | Key_name                     | Seq_in_index | Column_name       | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+---------+------------+------------------------------+--------------+-------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| article |          0 | PRIMARY                      |            1 | id                | A         |         217 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| article |          0 | ix_article_slug              |            1 | slug              | A         |         217 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| article |          1 | author_id                    |            1 | author_id         | A         |         104 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
+---------+------------+------------------------------+--------------+-------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
3 rows in set (0.01 sec)

mysql> DESCRIBE author;
+-------+--------------+------+-----+---------+----------------+
| Field | Type         | Null | Key | Default | Extra          |
+-------+--------------+------+-----+---------+----------------+
| id    | int          | NO   | PRI | NULL    | auto_increment |
| name  | varchar(256) | NO   | UNI | NULL    |                |
+-------+--------------+------+-----+---------+----------------+
2 rows in set (0.00 sec)

mysql> SHOW INDEX FROM author;
+--------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table  | Non_unique | Key_name       | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| author |          0 | PRIMARY        |            1 | id          | A         |         103 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| author |          0 | ix_author_name |            1 | name        | A         |         103 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+--------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
2 rows in set (0.01 sec)

Optimizing Simple Queries

Starting from the same simple query we used with Postgres, here is how to select the first five articles sorted by the article slug:

SELECT * FROM article ORDER BY article.slug LIMIT 5;

Let's see what MySQL things of this query with the EXPLAIN keyword prepended:

mysql> EXPLAIN SELECT * FROM article ORDER BY article.slug LIMIT 5;
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+----------------+
| id | select_type | table   | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra          |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+----------------+
|  1 | SIMPLE      | article | NULL       | ALL  | NULL          | NULL | NULL    | NULL |  217 |   100.00 | Using filesort |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+----------------+
1 row in set, 1 warning (0.00 sec)

The "filesort" operation name in MySQL is confusing, it simply means that the entries need to be sorted, as opposite to data that is indexed, which does not need sorting because the index itself keeps the data sorted.

So this is disappointing, because MySQL is reading all entries in the article table and sorting them by their slug, even though the article.slug column is indexed. I'm not really sure why this is the case, but I suspect that MySQL decided that at this scale (217 rows) it is more efficient to sort the entire table than reading the index for the first five entries, and then reading those from the table.

Let see if we can convince MySQL to use the article.slug index. In the next query, instead of requesting all the columns of the table, I'm just asking for the slug itself. This is a query that can be resolved just from data stored in the index, so I expect MySQL will prefer the index this time:

mysql> EXPLAIN SELECT slug FROM article ORDER BY article.slug LIMIT 5;
+----+-------------+---------+------------+-------+---------------+-----------------+---------+------+------+----------+-------------+
| id | select_type | table   | partitions | type  | possible_keys | key             | key_len | ref  | rows | filtered | Extra       |
+----+-------------+---------+------------+-------+---------------+-----------------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | article | NULL       | index | NULL          | ix_article_slug | 1026    | NULL |    5 |   100.00 | Using index |
+----+-------------+---------+------------+-------+---------------+-----------------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

And yes, this time it did prefer the index.

By now you can probably guess what would happen if we sort by the title column, which is not indexed:

mysql> EXPLAIN SELECT * FROM article ORDER BY article.title LIMIT 5;
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+----------------+
| id | select_type | table   | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra          |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+----------------+
|  1 | SIMPLE      | article | NULL       | ALL  | NULL          | NULL | NULL    | NULL |  217 |   100.00 | Using filesort |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+----------------+
1 row in set, 1 warning (0.00 sec)

Here MySQL falls back to the filesort algorithm, once again.

Optimizing Joins

Let's now look at a join. Like we did for Postgres, the following query retrieves the first five articles and the name of the authors, in alphabetical order by the author's name:

mysql> EXPLAIN SELECT article.*, author.name FROM article, author WHERE article.author_id = author.id ORDER BY author.name LIMIT 5;
+----+-------------+---------+------------+-------+----------------------+----------------------+---------+----------------+------+----------+-------------+
| id | select_type | table   | partitions | type  | possible_keys        | key                  | key_len | ref            | rows | filtered | Extra       |
+----+-------------+---------+------------+-------+----------------------+----------------------+---------+----------------+------+----------+-------------+
|  1 | SIMPLE      | author  | NULL       | index | PRIMARY              | ix_author_name       | 1026    | NULL           |    5 |   100.00 | Using index |
|  1 | SIMPLE      | article | NULL       | ref   | author_id            | author_id            | 5       | test.author.id |    2 |   100.00 | NULL        |
+----+-------------+---------+------------+-------+----------------------+----------------------+---------+----------------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)

If you prefer the JOIN version of this query:

mysql> EXPLAIN SELECT article.*, author.name FROM article JOIN author ON article.author_id = author.id ORDER BY author.name LIMIT 5;
+----+-------------+---------+------------+-------+-----------------------------------+----------------+---------+----------------+------+----------+-------------+
| id | select_type | table   | partitions | type  | possible_keys                     | key            | key_len | ref            | rows | filtered | Extra       |
+----+-------------+---------+------------+-------+-----------------------------------+----------------+---------+----------------+------+----------+-------------+
|  1 | SIMPLE      | author  | NULL       | index | PRIMARY                           | ix_author_name | 1026    | NULL           |    5 |   100.00 | Using index |
|  1 | SIMPLE      | article | NULL       | ref   | author_id,author_id_2,author_id_3 | author_id      | 5       | test.author.id |    2 |   100.00 | NULL        |
+----+-------------+---------+------------+-------+-----------------------------------+----------------+---------+----------------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)

As you see, for MySQL this query is efficient already. Unlike Postgres, MySQL automatically indexes foreign keys (when using the InnoDB engine), so there is nothing needing optimization here.

SQLite

As a final exercise, let's look at how the same queries evaluate in a SQLite database. Here is the schema that was generated by SQLAlchemy:

sqlite> .schema
CREATE TABLE author (
        id INTEGER NOT NULL,
        name VARCHAR(256) NOT NULL,
        PRIMARY KEY (id)
);
CREATE UNIQUE INDEX ix_author_name ON author (name);
CREATE TABLE article (
        id INTEGER NOT NULL,
        slug VARCHAR(256) NOT NULL,
        title VARCHAR(256),
        author_id INTEGER,
        PRIMARY KEY (id),
        FOREIGN KEY(author_id) REFERENCES author (id),
);
CREATE UNIQUE INDEX ix_article_slug ON article (slug);

Optimizing Simple Queries

As we did with Postgres and MySQL, we are going to start with a simple query that returns the first five articles sorted alphabetically by their slugs. This is the SQL query:

SELECT * FROM article ORDER BY article.slug LIMIT 5;

The SQLite database has two different ways to "explain" a query. The EXPLAIN directive outputs a very detailed list of low-level virtual machine operations that will be performed to carry out the query, while the EXPLAIN QUERY PLAN directive produces a high-level listing of the operations, more along the lines of the reports you've seen for Postgres and MySQL. When working with SQLite I always use the latter.

sqlite> EXPLAIN QUERY PLAN SELECT * FROM article ORDER BY article.slug LIMIT 5;
QUERY PLAN
`--SCAN TABLE article USING INDEX ix_article_slug

We can see in this query that SQLite scans the article table using the index on the slug column. This means that the index will be consulted to find those first five slugs, and then only those five entries will be read from the actual table. This is actually very good.

We've seen that Postgres optimized the query even more when the information can be obtained directly from the index, without having to read the table. Does SQLite have the same optimization? Only one way to find out:

sqlite> EXPLAIN QUERY PLAN SELECT slug FROM article ORDER BY article.slug LIMIT 5;
QUERY PLAN
`--SCAN TABLE article USING COVERING INDEX ix_article_slug

Here we are returning only the slugs, and since these are stored in the index, SQLite saves time and does not read the database table. SQLite calls this a covering index.

What happens if we try to sort by the title column instead, which is not indexed?

sqlite> EXPLAIN QUERY PLAN SELECT * FROM article ORDER BY article.title LIMIT 5;
QUERY PLAN
|--SCAN TABLE article
`--USE TEMP B-TREE FOR ORDER BY

You can see here that the table scan does not use an index, and that SQLite must perform a sort in memory or disk (depending on size), which it implements by creating a temporary b-tree. To optimize this query you would simply add an index for the title column.

Optimizing Joins

Next we are going to look at how a join is evaluated. As we did with Postgres and MySQL, the next query retrieves the first five articles and the names of the authors, in alphabetical order by those names:

sqlite> EXPLAIN QUERY PLAN SELECT article.*, author.name FROM article, author WHERE article.author_id = author.id ORDER BY author.name LIMIT 5;
QUERY PLAN
|--SCAN TABLE article
|--SEARCH TABLE author USING INTEGER PRIMARY KEY (rowid=?)
`--USE TEMP B-TREE FOR ORDER BY

Below is the alternative syntax using JOIN, which should be equivalent:

sqlite> EXPLAIN QUERY PLAN SELECT article.*, author.name FROM article JOIN author ON article.author_id = author.id ORDER BY author.name LIMIT 5;
QUERY PLAN
|--SCAN TABLE article
|--SEARCH TABLE author USING INTEGER PRIMARY KEY (rowid=?)
`--USE TEMP B-TREE FOR ORDER BY

Here SQLite starts by scanning the article table without an index, and then performing the join with the author table by searching the primary key (which is always indexed). It then sorts with a b-tree to obtain the first five elements.

This isn't really that great, so we should see if adding an index on the author_id column which is used in the join condition improves things. From SQLAlchemy this would be simply adding index=True on the author_id column:

class Article(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    slug = db.Column(db.String(256), index=True, unique=True, nullable=False)
    title = db.Column(db.String(256))
    author_id = db.Column(db.Integer, db.ForeignKey('author.id'), index=True)  # <-- add an index here

Using SQL, you can achieve the same thing with this statement:

CREATE INDEX ix_article_author_id ON article (author_id);

Unfortunately adding an index did not really change the way the query is evaluated:

sqlite> EXPLAIN QUERY PLAN SELECT article.*, author.name FROM article, author WHERE article.author_id = author.id ORDER BY author.name LIMIT 5;
QUERY PLAN
|--SCAN TABLE article
|--SEARCH TABLE author USING INTEGER PRIMARY KEY (rowid=?)
`--USE TEMP B-TREE FOR ORDER BY

To understand this I have found the documentation on how SQLite resolves joins useful. In SQLite joins are always implemented as a nested loop, and only the inner loop can be optimized with an index, so for SQLite this is as good as it gets.

Slow Queries without Sequential Scans

While this isn't the topic of this article, you may find that you have a complex query that performs badly even after you made sure that all operations are backed by indexes. This can happen for queries that require joining a large number of tables, for example, simply because the database needs to read data from a lot of different locations to create the desired results. What do you do to optimize such a query?

When a query that doesn't have any large scans is slow, it is an indication that the structure of your database might be making it hard for the query to be resolved efficiently. Here I don't mean to say that you have structured your database incorrectly. In fact it is the contrary! The relational model promotes the creation of lots of tables, each with very specific information and links to other tables. This has the cost of making queries that need to combine all these bits of data into a single set of results more expensive. Sometimes following the relational model to the dot is actually the problem.

A process called denormalization can be used to duplicate some data in a relational database with the purpose of simplifying queries. Of course this makes writing the data more expensive and complicated, but it can help reduce the number of joins that need to be performed in your queries.

Conclusion

I hope this article helps you get started on debugging your slow queries. To summarize, here is how I go about optimizing slow database queries:

  • Identify a slow query
  • Use EXPLAIN or EXPLAIN QUERY PLAN on the query to see how the database interprets it. This must be done on the actual database in which the query was found to be slow. Doing it on a smaller development database, for example, may not give you the same results.
  • Look for table scans that can be converted to indexed scans by adding indexes. Use the database documentation if you need help understanding parts of the EXPLAIN output.
  • If you can't find any missing indexes, and the query is still slow, see if it is possible to reduce the complexity of the query by duplicating (denormalizing) some data.

Do you have any other tricks to optimize a slow database? Let me know below in the comments!

3 comments

  • #1 Shahrukh said 2021-09-03T01:59:00Z

    Didn't know that foreign keys are not indexed by default. Thanks for the insight.

  • #2 Vamsi Ramineedi said 2021-09-14T18:52:56Z

    Excellent article Miguel. Do you have any thoughts on AI tools like 'eversql' for optimizing sql queries? Would they automatically check the missing indexes and denormalize?

  • #3 Miguel Grinberg said 2021-09-15T07:42:40Z

    @Vamsi: I'm not familiar with EverSQL, but if you look at their FAQ they say that they accept the query, the database schema and the "explain" output as inputs, and they provide recommendations based on all that information, including indexing changes. I doubt they'll go as far as suggesting how to denormalize.

Leave a Comment