2017-10-10T15:22:43Z

Implementing User Comments with SQLAlchemy

One of the most basic ways to keep the users of your web application engaged is to give them a space to write comments. These days, there are third-party services pretty much for everything, and comments is not an exception. Disqus and Facebook are popular services that allow you to embed comments into your site.

But what if you don't want to use an external service? In this article, I'm going to show you how I implement comments in Python, using the SQLAlchemy ORM and any of the database engines it supports. I'm going to start with an extremely simple approach, and then will go on to discuss a few advanced implementations with support for multiple levels of replies.

The Issues with Comment Services

While offloading your comments to an external service is tempting, there are many reasons why you may not want to do it. The user interface that these services embed into your pages is usually not very flexible, so it may not look well with your site layout. Also, some of your users may find it odd that even though they have an account for your web application, they need to create a second account with another service to write comments.

A valid concern I also heard many other developers mention is the fact that you do not own the comments that appear on your site, and there potential difficulties in exporting this data, should you decide to leave your current provider, or worse the provider deciding to leave you by going out of business.

There is also a security aspect. You may not feel safe trusting information about your users to these large companies that are constantly being attacked by hackers. Just a few days ago, Disqus announced that it has suffered a data breach.

Base Comment Platform

If you are not very picky, you can create a basic comment solution with very little effort. Here is a basic SQLAlchemy model that gets the job done:

from datetime import datetime

class Comment(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    text = db.Column(db.String(140))
    author = db.Column(db.String(32))
    timestamp = db.Column(db.DateTime(), default=datetime.utcnow, index=True)

With this simple model you can keep track of a list of comments. Quick disclaimer: if you are used to working with SQLAlchemy on its own, you will not recognize the db instance that I've used above. For convenience, all the examples in this article use the Flask-SQLAlchemy extension which builds on top of SQLAlchemy, and exposes all the SQLAlchemy attributes from a db database instance. If you are using SQLAlchemy without the Flask extension, you will need to make minor changes to import all the attributes attached to db from their native SQLAlchemy modules.

To add a new comment, you simply create a new Comment instance and write it to the database:

comment = Comment(text='Hello, world!', author='alice')
db.session.add(comment)
db.session.commit()

Note how I do not worry about the timestamp field, because in the model definition I have it getting the current UTC time by default. Thanks to the auto-timestamping, I can efficiently retrieve all comments sorted by date in ascending or descending order:

# oldest comments first
for comment in Comment.query.order_by(Comment.timestamp.asc()):
    print('{}: {}'.format(comment.author, comment.text))

# newest comments first
for comment in Comment.query.order_by(Comment.timestamp.desc()):
    print('{}: {}'.format(comment.author, comment.text))

To integrate this solution with your application, you may need to change the author field to be a foreign key into your User model instead of being just a string. If you accept comments on many different pages, you may also need to add an additional field that links each comment to the page of your application it belongs, and then that will allow you to retrieve comments for each page by filtering on that field. This is actually the implementation that I have chosen for the comments on this blog.

A simple, yet complete implementation of this technique is available in this gist.

Implementing Comment Replies

If all you need is a flat list of comments, then the simple implementation in the previous section should do the job nicely. But what if that isn't enough?

For many applications, you may want users to be able to reply to comments made by other users, and then display all these linked comments as hierarchical threads. Believe it or not, this is extremely difficult to do in a relational database.

There are two fairly well-known implementations that address the problem of representing a tree structure in relational form, but unfortunately both have severe limitations. I'll first describe them to you, along with their problems, and then I'll tell you about my own solution, which also has some limitations but they are not as bad as those of the others.

Adjacency Lists

The first approach is called Adjacency List and is actually very simple to implement. The idea is to add a column to the Comment model that tracks the parent of each comment. If each comment has a relationship to its parent, then you can figure out the entire tree structure.

For the model, you would have something like this:

class Comment(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    text = db.Column(db.String(140))
    author = db.Column(db.String(32))
    timestamp = db.Column(db.DateTime(), default=datetime.utcnow, index=True)
    parent_id = db.Column(db.Integer, db.ForeignKey('comment.id'))
    replies = db.relationship(
        'Comment', backref=db.backref('parent', remote_side=[id]),
        lazy='dynamic')

What I did here is add a self-referential one-to-many relationship to the model I used above. Since each comment now has a parent_id foreign key, I can easily find the direct replies to a given comment just by looking for all comments that have their parent set to that comment.

For example, let's say I want to represent the following comment thread:

alice: hello1
  bob: reply11
    susan: reply111
  susan: reply12
bob: hello2
  alice: reply21

The code to add the comments with the above structure is shown below:

c1 = Comment(text='hello1', author='alice')
c2 = Comment(text='hello2', author='bob')
c11 = Comment(text='reply11', author='bob', parent=c1)
c12 = Comment(text='reply12', author='susan', parent=c1)
c111 = Comment(text='reply111', author='susan', parent=c11)
c21 = Comment(text='reply21', author='alice', parent=c2)
db.session.add_all([c1, c2, c11, c12, c111, c21])
db.session.commit()

So far, this is all fairly easy. The problem comes when you need to retrieve the comments in a way that is proper for presentation. There is actually no possible query to retrieve these comments in the correct threaded order. The only way to do it is by issuing recursive queries. The following code uses recursive queries to print the comment thread to the terminal with the proper indentation:

def display_comment(comment, level=0):
    print('{}{}: {}'.format('  ' * level, comment.author, comment.text))
    for reply in comment.replies:
        display_comment(reply, level + 1)

for comment in Comment.query.filter_by(parent=None).order_by(Comment.timestamp.asc()):
    display_comment(comment)

The for-loop at the bottom retrieves all the top-level comments (those that have no parent), and then for each of them it recursively retrieves their replies in the display_comment() function.

This solution is extremely inefficient. If you have a comment thread with a hundred comments, you will need to issue a hundred additional database queries after the one that gets the top-level comments to reconstruct the entire tree. And if you wanted to paginate your comments, the only thing you can do is paginate the top-level-comments, you can't really paginate the comment thread as a whole.

So while this solution is extremely elegant, in practice you can't really use it unless your dataset is small. You can see a complete implementation of this technique in this gist.

Nested Sets

The second technique is called Nested Set. This is a fairly complex solution that adds two columns to the table called left and right, and a third optional one called level. All columns store numbers, and are used to describe the traversal order for the tree structure. When you go down the three you assign sequential numbers to the left field, and when you go up you assigned them to the right field. As a result of this numbering, comments that have no replies have consecutive numbers for their left and right. The level column keeps track of how many levels of parents each comment has.

For example, the comment thread above would be given the following left, right and level values:

alice: hello1        left:  1  right:  8  level: 0
  bob: reply11       left:  2  right:  5  level: 1
    susan: reply111  left:  3  right:  4  level: 2
  susan: reply12     left:  6  right:  7  level: 1
bob: hello2          left:  9  right: 12  level: 0
  alice: reply21     left: 10  right: 11  level: 1

With this structure, if you want to obtain the replies to a given comment, all you need to do is look for all comments where their left is greater than the parent's left and their right is less than the parent's right. For example, the children of the top post by alice are those with left > 1 and right < 8. The children of the post by bob in the second line are those with left > 2 and right < 5. If you sort the results by left, you get them in the correct threaded order, and then you can use level to determine the indentation to use when you render them on a web page. The big advantage of this method versus adjacency lists is that you can get the comments in the correct threaded order with a single database query, and even use pagination to get a subset of the thread.

You may be thinking that this is actually a pretty clever solution that nicely addresses this problem, but have you considered what the algorithm looks like to assign these three numbers to each comment? That is where the problem with this solution lies. Each time a new comment is added, a potentially large portion of the comments table will have to be updated with new left and right values. When you work with adjacency lists, the insertions are cheap and the queries are expensive. With nested sets it is the reverse, insertions are expensive and queries are cheap.

I have never implemented this algorithm myself, so I do not have example code readily available to show you how it looks, but if you want to see a real-world implementation, the django-mptt project is a great example that works with the Django ORM. You can probably guess that queries are fairly simple from the examples above, but the logic required to insert a new comment is complex and highly inefficient, as a large number of comments might need to be updated depending on where in the tree the new comment is inserted. This solution only makes sense in cases where insertions are uncommon and queries are frequent.

Thinking Outside the Box

Unfortunately none of the solutions above worked well for my needs. I came up with a completely different approach that has both efficient inserts and queries, but in exchange has other, less severe limitations.

This solution adds a single column of type text, which I'm going to name path:

class Comment(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    text = db.Column(db.String(140))
    author = db.Column(db.String(32))
    timestamp = db.Column(db.DateTime(), default=datetime.utcnow, index=True)
    path = db.Column(db.Text, index=True)

Each comment gets assigned a unique, increasing value when it is inserted, pretty much in the same way each comment gets the auto-incrementing id by the database. So the first comment gets a 1, the second gets a 2, and so on. The contents of path for a top-level comment is the value of this counter. But for a reply, path is set to the path of the parent with the counter appended at the end. Using the same comment hierarchy from above examples, here are those comments in the random order they may have been entered, with their path values assigned:

alice: hello1        path: '1'
bob: reply11         path: '1.2'
bob: hello2          path: '3'
susan: reply12       path: '1.4'
susan: reply111      path: '1.2.5'
alice: reply21       path: '3.6'

For clarity, I have inserted a period in between each path component, but in a real implementation that is not necessary. Now if I run a query on this table that sorts rows by path, I get the correct threaded order. And to know what level of indentation each comment needs to have, I can look at how many components the path have.

alice: hello1        path: '1'    <-- top-level
bob: reply11         path: '1.2'    <-- second-level
susan: reply111      path: '1.2.5'    <-- third-level
susan: reply12       path: '1.4'    <-- second-level
bob: hello2          path: '3'    <-- top-level
alice: reply21       path: '3.6'    <-- second-level

Inserting a new comment using this method is fairly cheap. I just need to have a way to generate a unique and increasing value to assign to the new comment, which I can, for example, steal from the database assigned id. I also need to know the parent of the comment, so that I can take its path and use it when creating the path of the child comment.

Queries are also cheap. By adding an index on the path column, I can get the comments in the correct threaded order very efficiently, just by sorting by path. And I can also paginate the list.

So if this is all so great, what are the bad news? Take a look at the path assignments in the example above and see if you can spot the limitation.

Did you find it? How many comments do you think this system supports? In the way I constructed this example, you can't really have more than 10 comments (or actually 9, unless you start counting from 0). Sorting by path only works when the numbers that are used in the path field have all the same number of digits, in this example just one. Once a 10 appears, the sorting breaks, because I'm using strings, so 10 sorts between 1 and 2 and not after 9.

So what's the solution? Let's allocate two digits for each component in the path:

alice: hello1        path: '01'
bob: reply11         path: '01.02'
susan: reply111      path: '01.02.05'
susan: reply12       path: '01.04'
bob: hello2          path: '03'
alice: reply21       path: '03.06'

Now I can go up to 99 comments, if I'm careful to right align and zero-pad each component. But of course this is still too limiting, so instead of two digits you will probably want to use more. If you use six digits, for example, you can have up to a million comments before you run into problems. And if you find that you are approaching the limit with the number of digits that you used, you could take the comments offline for maintenance, regenerate the paths with more digits and then you would be okay again.

The implementation is actually not that bad. I decided to combine this solution with the adjacency list option I presented above, as that gives me an easy and efficient way to obtain the parent given a comment (I could do away with the adjacency list and extract the parent id from the path field, but that seems overly complicated). I encapsulated the comment insertion logic in a save() method in the Comment model, so that it can be invoked easily from any part of the application. Here is the updated model, including the reintroduced adjacency list, the save() method, and as a bonus, a level() method that returns the indentation level of any comment:

class Comment(db.Model):
    _N = 6

    id = db.Column(db.Integer, primary_key=True)
    text = db.Column(db.String(140))
    author = db.Column(db.String(32))
    timestamp = db.Column(db.DateTime(), default=datetime.utcnow, index=True)
    path = db.Column(db.Text, index=True)
    parent_id = db.Column(db.Integer, db.ForeignKey('comment.id'))
    replies = db.relationship(
        'Comment', backref=db.backref('parent', remote_side=[id]),
        lazy='dynamic')

    def save(self):
        db.session.add(self)
        db.session.commit()
        prefix = self.parent.path + '.' if self.parent else ''
        self.path = prefix + '{:0{}d}'.format(self.id, self._N)
        db.session.commit()

    def level(self):
        return len(self.path) // self._N - 1

The _N class variable stores the number of digits I'm using for each component. For this example I set it to 6, which supports up to a million comments. To obtain a unique and auto-incrementing number to use in the path, I just steal the id assigned by the database, so for that reason I have to save the comment twice. First I save it without the path to have the database assign the id, then a second time with the path set. Saving the comment twice is not ideal, but I think it is a good compromise, considering all the benefits I get. It would also be possible to avoid the double save if you come up with a different way to generate auto-incrementing numbers, but that requires a very careful design to avoid race conditions, so I decided to stick with the double save solution.

In this implementation I used the dot separators between components, but that isn't really needed. I left them there because it makes the path more readable, but if you prefer to save space, it is totally fine to not include the periods and make path a packed sequence of digits.

The level() method is very easy to implement, by taking the length of the path attribute and dividing it by the number of digits in each component. This method can be extremely useful to generate the proper indentation when you render these comments as a thread.

Below you can see how I can insert the comments with the structure I've been using in the examples above. Basically I had to stop referencing the database session directly and instead call to save() for each comment:

c1 = Comment(text='hello1', author='alice')
c2 = Comment(text='hello2', author='bob')
c11 = Comment(text='reply11', author='bob', parent=c1)
c12 = Comment(text='reply12', author='susan', parent=c1)
c111 = Comment(text='reply111', author='susan', parent=c11)
c21 = Comment(text='reply21', author='alice', parent=c2)
for comment in [c1, c2, c11, c12, c111, c21]:
    comment.save()

And here is how I can print the comments to the terminal with the correct indentation:

for comment in Comment.query.order_by(Comment.path):
    print('{}{}: {}'.format('  ' * comment.level(), comment.author, comment.text))

The complete and runnable example for this implementation is in this gist.

Possible Improvements

I think this solution is very decent, but depending on the application you may find that you need to tweak things a little bit to achieve what you want.

As I presented it above, this solution can manage a single set of comments. Unfortunately, this isn't that useful, as most applications have many pages on which users can write comments. To be able to retrieve just the comments that apply to a single page, all you need to do is add another column to the the Comment model that links to the page on which the comment should appear. For example, in a blog application, this could be a foreign key to the post id. This id needs to be copied in all the comments, including replies, so that you can run a query similar to the following:

for comment in Comment.query.filter_by(post_id=post.id).order_by(Comment.path.asc()):
    print('{}{}: {}'.format('  ' * comment.level(), comment.author, comment.text))

The save() method can copy the post_id field from the parent into the child comment so that you don't have to manually copy these IDs all the time.

Another limitation with this solution is that it can only retrieve the comments in the order the top-level comments were made, from oldest to newest. For many applications you may want to sort the top-level comments from newest to oldest, while still have all the replies in threaded order under each parent comment. In other cases, your users could be voting top-level comments up or down, and you want to show the most voted comments first.

To implement these alternative sorting strategies you have to use an additional column. If you want to be able to sort by the timestamps of the top-level comments, you can just add a thread_timestamp column, which has the timestamp of the top-level comment duplicated in each reply. The save() method can pass this timestamp down from parent to children, so that it does not become a burden to manage this additional column. Then you can sort by timestamp with a secondary sort by path to preserve the order of the replies:

for comment in Comment.query.order_by(Comment.thread_timestamp.desc(), Comment.path.asc()):
    print('{}{}: {}'.format('  ' * comment.level(), comment.author, comment.text))

If you want to sort by the votes users make on top-level comments, the solution is similar. Instead of a thread_timestamp, you have to use a thread_votes column. For this solution to work, you still need to have the value of this column duplicated in all the replies associated with the parent comment. If you wanted to show most voted top-level comments first, you would do something like this:

for comment in Comment.query.order_by(Comment.votes.desc(), Comment.path.asc()):
    print('{}{}: {}'.format('  ' * comment.level(), comment.author, comment.text))

The voting solution has a twist, however. Users will be voting top-level comments up or down, so each time a top-level comment receives a vote, the new vote score needs to be written not only on the top-level comment, but also on all the replies, to ensure that the sorting keeps the threads sorted correctly. You could do this update in two steps, first by getting the list of children, and then updating the vote score on all of them:

class Comment(db.Model):
    def change_vote(vote):
        for comment in Comment.query.filter(Comment.path.like(self.path + '%')):
            self.thread_vote = vote
            db.session.add(self)
        db.session.commit()

If you prefer something a little bit more efficient, you can do it with a update() call bypassing the ORM.

Conclusion

I hope this was a useful overview that helps you find the best solution for your application's comment platform. As I indicated above, I have a gist with example code for flat comments, adjacency lists, and the last solution based on comment paths. Do you use a different solution? I'd like to know, so please let me know below in the comments.

10 comments

  • #1 Fernando said 2017-10-10T18:43:30Z

    Nice post! Comments on a real application as shown above definitely are non trivial.

  • #2 Matt said 2017-10-11T13:22:42Z

    Fun and inventive exploration of maintaining trees in flat RDBs— and insightful, as always! I’d been puzzling over similar trade-offs for a different application before deciding I should just use MongoDB. This article reminded me how much fun it would’ve been to hack something together, instead. Good times.

  • #3 Paul Everitt said 2017-10-11T13:55:44Z

    We use Adjacency List quite a bit with recursive CTEs which can walk the list in one query.

  • #4 Miguel Grinberg said 2017-10-11T21:50:48Z

    @Paul: I have not really considered CTEs due to not having uniform support across popular DB engines (i.e. MySQL). But it looks like MySQL 8 is going to have CTEs so I may need to revisit this topic.

  • #5 Kevin said 2017-10-31T10:17:30Z

    Awesome! Thanks But I think save() method should be "return len(self.path.split('.')) - 1"

  • #6 Alex said 2017-11-09T02:44:17Z

    So I see you've implemented these models in SQL. But how do I create a web form that handles all this? I have a feeling it has to do with WTForms and wtf.quickform but I'm not sure...

  • #7 Miguel Grinberg said 2017-11-09T06:58:26Z

    @Alex: Have you see my beginner-level articles, like the Flask Mega-Tutorial, or my book, or the PyCon tutorials on youtube? This article is about the technique to do threaded comments, if you want to learn the more basic topics such as how to create a web form, look for those.

  • #8 Kyle Lawlor said 2017-11-25T22:52:33Z

    This may be a situation where using a graph database could offer some value. If I can get around to implementing something I will come back to share :)

  • #9 oneohthree said 2017-12-08T01:00:23Z

    I'm sorry to ask here. How can I specify a PostgreSQL schema in Flask SQLAlchemy connection? I dropped Public schema and I'm using a custom one. Thank you for this great blog.

  • #10 Miguel Grinberg said 2017-12-08T23:28:37Z

    @dd: You can add the schema as a __table_args__ attribute in your model declaration. Example: class MyModel(db.Model): __table_args__ = {'schema':'my_schema'}

Leave a Comment