Essential elements of high performance applications
Offloading work to the SQL database
In the previous post in this series we discussed the importance of proper use of indexes for performance when defining your SQL database. Creating indexes is a simple but essential part of application performance. However, once your database schema is created and indexes are employed, another key aspect of building a fast application is effectively leaning on the database in order to quickly execute work and perform computations. Doing so effectively will often reduce the total amount of work that your application servers as well as your database have to perform. What this looks like in practice is writing SQL queries that use the full breadth of features available.
Joins and Subqueries
One of the simplest examples of this is knowing how and when to employ JOIN
queries, as well as understanding the differences between different types of JOIN
s and how they can impact performance.
For example, a common practice that hurts application performance is employing “N+1” queries. We’ve written about this as well as other approaches for retrieving data in the beginning of our post on grouping query results with Haskell. In short, code structured in this manner will run one query for each item in a list of other items retrieved in a preceding query.
Here’s an example of code that does this, where we need to query both a posts
table and a users
table to determine the total number of posts made by each author that has recently made a post:
# Runs a SQL query to retrieve recent posts
= selectAllRecentPosts()
posts = {}
postCounts for post in posts:
# Runs a SQL query to count of posts for a given user
= countPostsForAuthor(post.authorId) postCounts[post.authorId]
Where selectAllRecentPosts()
may run a query like:
SELECT *
FROM posts
WHERE created >= NOW() - INTERVAL '1' DAY;
And countPostsForAuthor(123)
runs a query like:
SELECT COUNT(DISTINCT posts.id)
FROM posts
WHERE posts.author_id = 123;
These queries are both simple, and it’s easy to make them both take advantage of indexes in order to ensure they run quickly. However, the issue with the N+1 approach is that the second query has to run for each post retrieved by the first query. This means that a list of 20 posts will run 21 (or 20+1) queries.
Not only does this require waiting for the database to execute 21 queries (a number that will scale as the number of recently created posts goes up), but this also requires 21 separate network calls between the application and the database. It is common practice to have database server hardware that is separate from the application server, so network latency of 7 milliseconds per query will add 147 milliseconds to our request in network latency alone. If we had 150 recently created posts, this network latency would balloon to a whopping 1,057 milliseconds, or over 1 second of just waiting for the application to send and retrieve data from the database.
With this in mind, we want to offload this work to the database while also reducing the number of database calls required. So instead of the code we saw earlier, we’re going to move all of the work to the database and change our code to invoke the following query:
SELECT recent_authors.id, COUNT(posts.id)
FROM (
SELECT DISTINCT users.*
FROM users
INNER JOIN posts
ON users.id = posts.author_id
WHERE posts.created >= NOW() - INTERVAL '1' DAY
AS recent_authors
) INNER JOIN posts
ON recent_authors.id = posts.author_id
GROUP BY recent_authors.id;
This query is more complicated than either one of the preceding queries on their own. It employs an INNER JOIN
to retrieve users
as well as the associated posts
table data based. It also uses a subquery to determine which authors have been recently active. Using this approach, we are certain that we will always run one query. Moreover, our code will have been simplified as well, as we were able to eliminate the for
loop entirely and just invoke this query directly.
This example of replacing two separate queries with a single more complex query that employs a JOIN
and a subquery illustrates our general point—let the database do the work.
One other important point to make here is that the type of JOIN
we are using here is very deliberate. An INNER JOIN
will join together two tables and only keep rows that have matches on both the left and right side of the join. In our case, that means every result must include both a users
table entry and a posts
table entry. This is exactly what we want, since we’re only interested in display posts counts for users that have recently made a post.
If, by contrast, we wanted to generate a report that included the name of each user as well as how many posts they’ve authored, including users that did not author any posts, we might write a query like this:
SELECT users.id, COUNT(DISTINCT posts.id)
FROM users
LEFT JOIN posts
ON users.id = posts.authorId
GROUP BY users.id;
Here, the LEFT JOIN
(which can also be called and written as LEFT OUTER JOIN
) will change the way the query runs to keep results from the users
table even if there are no matching posts
. This is a requirement given that we are interested in displaying a 0
count for users that have never authored anything.
Different types of joins have different behaviors and performance implications. INNER
and LEFT
joins are the two most common types you are likely to encounter in most applications, but there are also FULL
, RIGHT
, and CROSS
joins that you may encounter in certain situations.
SQL databases aren’t slow, and in a properly indexed database with sensibly written queries, leaning into the database will result in better system performance than attempting to write simpler queries and then instead performing the same work in the application.
Grouping and Aggregations
Another way to lean on our database is to use it for calculating aggregates. In simple terms, aggregation is the computation of a single value from many values (or many rows, in the case of SQL aggregations). Examples of aggregate functions supported by most databases include SUM
, AVG
, MIN
, MAX
. Perhaps the most commonly used aggregation function is COUNT
. Aggregations will almost always be used alongside a GROUP BY
statement, which defines how to aggregate the rows for a given set of results.
We employed this manner of grouping above in order to count the number of posts each user had created in our specified search interval. In the outer query, we indicated that we wanted to GROUP BY users.id
, which causes all rows with a given user ID to “collapse” into a single row, and any aggregation functions—the COUNT
function in our case—to treat all rows in each group as an input to the respective calculation. In our query, every distinct posts.id
for each users.id
was counted, giving us the number of posts each user created.
Grouping can be somewhat confusing to people first learning SQL, since when the aggregation function is removed, we can’t typically “see” the results of the GROUP BY
operation. Different database engines will handle this situation slightly differently, but generally speaking the effect of grouping isn’t visible until some aggregation function is applied.
Window Functions
Above, we mentioned how grouping causes rows to “collapse” into a single row, specified by whatever column(s) should be used as the group identifier. Window functions provide another way to apply aggregations to subsets of rows in a given set of results, but unlike with grouping, using window functions allows rows in a given grouping (or “partition”) to maintain their own identity and continue to appear in the result set.
Basic window function example
This behavior of window functions is best illustrated through an example. Suppose we have a post_metrics
table that contains metrics about the performance of each post in our blog. We query the table but use a window function in order to see view count metrics about each individual post, as well as the average post performance for each author.
SELECT
author,
post_title,
view_count,AVG(view_count) OVER (PARTITION BY author) AS average_view_count
FROM
post_metrics;
Which would give us a set of results like:
author | post_title | view_count | average_view_count |
---|---|---|---|
John Smith | SQL basics, pt 1 | 1012 | 772 |
John Smith | SQL basics, pt 2 | 748 | 772 |
John Smith | SQL basics, pt 3 | 556 | 772 |
Michael Walters | Use Python for your next project | 972 | 1660.5 |
Michael Walters | I was wrong, use Rust for your next project | 2349 | 1660.5 |
As we can see above, the benefit of using a window function over a GROUP BY
statement is that we preserve the presence of each row in the result set. If we were to GROUP BY author
(as opposed to applying the window over PARTITION BY author
), we would still be able to see the average_view_count
. However, depending on database engine, inferring information about the post_title
and view_count
columns would be either impossible due to error (such as in PostgreSQL, which would not allow us to include these non-aggregated and non-grouped columns in the query) or nonsensical (such as in MySQL, which would display an arbitrary single value from the group).
More advanced window function example
Let’s look at another application of window functions that really shows their strength. Suppose that we’re writing banking software and want to create an account statement that includes account activity, such as the list of all debits and credits that most banks make visible in their web applications. Let’s assume our database already has a charges
table that includes each debit or credit, including the id of the account, the value of the change, and the date with the following data:
account_id | value | date |
---|---|---|
1 | 20 | 2022-07-12 |
2 | 95 | 2022-07-12 |
2 | 40 | 2022-07-13 |
1 | 35 | 2022-07-13 |
1 | 50 | 2022-07-14 |
1 | -15 | 2022-07-15 |
2 | -135 | 2022-07-15 |
Generating a statement that includes activity requires indicating whether each charge is a debit or a credit. In our case we’ll treat positive values as a debit and negative charges as a credit. The activity view also requires displaying the account balance (the running total) along with each charge. Fortunately, we can use a window function to achieve this.
SELECT
account_id,abs(value) AS charge_amount,
CASE WHEN value > 0 THEN 'CREDIT' ELSE 'DEBIT' END AS charge_kind,
SUM(value) OVER (
PARTITION BY account_id
ORDER BY created_at
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
AS balance,
) AS transaction_date
created_at FROM charges;
This produces:
account_id | charge_amount | charge_kind | balance | transaction_date |
---|---|---|---|---|
1 | 20 | CREDIT | 20 | 2022-07-12 |
1 | 35 | CREDIT | 55 | 2022-07-13 |
1 | 50 | CREDIT | 105 | 2022-07-14 |
1 | 15 | DEBIT | 90 | 2022-07-15 |
2 | 95 | CREDIT | 95 | 2022-07-12 |
2 | 40 | CREDIT | 135 | 2022-07-13 |
2 | 135 | DEBIT | 0 | 2022-07-15 |
Like our first example, this query partitions the data, in this case by the account_id
. However, then the window function applies both an ordering as well as indicates which rows should be used in the calculation. The ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
tell the database to use only rows up to and including the current row in order to calculate the sum of the charges. Without specifying this, we would not see a running total in each row, but the final balance amount repeated for every row in each partition, similar to how we saw the average_view_count
repeated in the previous example.
In practice, we may want to limit the data retrieval to one account at a time. In order to do that, we’d simple add a condition like WHERE charges.account_id = 1
to the end of the query. The window function would still generate the balance
column for just the one account.
When building our application, we can utilize window functions to perform these types of operations directly in the query, precluding the need to make an extra pass over the result set in our application code. Without the use of a window function, if we wanted to preserve individual post metrics in our result set, we would either need to write a second query to calculate the average view count or make a pass through the result set in our application code in order to compute the average for each author.
Set Operations: Union, Intersect, and Except
Another way of maximizing the workload of the database is utilizing UNION
, which enables two otherwise distinct queries that return the same column set to be executed in a single query. UNION
can be used to reduce the number of distinct but similar calls to the database, reducing the round trip time that we highlighted earlier. UNION
can also be used as an optimization technique , enabling us to rewrite queries in a way that improves their performance while making them easier to understand. In the linked post, we walk through an in-depth example of a large query joining many tables with complex join conditions that was split into two much simpler queries and the result was combined using UNION
.
The general structure of a UNION query is:
SELECT
col_1, col_2, col_3FROM
some_tableINNER JOIN
some_other_table...
UNION
SELECT
col_1, col_2, col_3FROM
some_tableLEFT JOIN
another_table...
The result of a query shaped like the one above would be a single set of rows with the col_1
, col_2
, and col_3
columns. It does not matter which tables either side of the UNION
performs a SELECT
on, so long as the set of columns returned on each side are named the same and returned in the same order.
The UNION
operation has the benefit of automatically deduplicating rows. If deduplication is unnecessary or undesirable, the UNION ALL
operation should be used instead. For larger data sets, the performance of UNION ALL
may be better than UNION
as the deduplication can add to the query execution time.
In addition to UNION
, there are also INTERSECT
and EXCEPT
operations that can be performed. INTERSECT
will only include rows that appear in the result of both queries, whereas EXCEPT
will include all rows from the left query other than ones that also appear in the right query. Both INTERSECT
and EXCEPT
will deduplicate rows unless the ALL
keyword is used as a suffix. These operations are useful for filtering data when a single SELECT
query cannot be used to define the conditions while achieving a well performing query, or when readability of the single-query form suffers.
All three operations can be used to both reduce the number of distinct queries that need to be sent to the database as well as reduce the workload of the application servers. For example, there’s no need for merging and deduplicating the result of two separate queries in your code when the SQL database has already done the work by the time the query results are returned.
Wrap Up
In this post we discussed some of the primary operations that SQL databases enable us to write queries that not only retrieve data but also perform complex computations. Effectively utilizing features like subqueries, aggregations, and set operations allows us to reduce the amount of chatter with the database, cuts down on the amount of data sent over the network, and can altogether eliminate computations that both our database and application servers need to perform. Applying these techniques effectively is absolutely critical to a highly performant application.
Further reading about offloading work to the database
Below, we link to PostgreSQL documentation for most of these features, but the functionality is largely the same in MySQL/MariaDB, SQL Server, SQLite, and other SQL databases. If you’re just trying to generally familiarize yourself with these features, reading any database’s documentation should suffice.
- PostgreSQL joins tutorial
- PostgreSQL aggregate function tutorial
- PostgreSQL window function tutorial
- PostgreSQL union, intersect, and except tutorial
This post is part of a series titled Essential elements of high performance applications. The full list of published posts is available below.
- SQL indexes
- Offloading work to the SQL database (this post)
- Server side caching
Christian Charukiewicz and Ben Levy are Partners at Foxhound Systems, where we focus on building fast and reliable custom software. Are you looking for help with something you’re working on? Reach out to us at info@foxhound.systems.