Reducing the pain of grouping SQL query results using Haskell
Relational databases allow us to model the associations between different types of data in our system domain. Most application database schemas rely on normalization to avoid data duplication. We use SQL to retrieve this data from a database, but SQL has limitations. When we need data from several tables, we’re forced to make trade offs in how we query our data, and our query results often do not contain an ideal representation of the relationships between our data entites.
In order to mitigate this limitation of SQL, we typically transform the data we retrieve via our queries in our application layer. With a system written in Haskell, we can use the Semigroup
typeclass and the append operation it exposes (<>
) to transform the data into the shape we need by defining our desired custom data types and simple transformation functions. In this post we’ll explore this method of solving this problem in more detail.
Limitations of SQL
SQL is an excellent tool for querying data, with queries allowing us to retrieve data from one or several tables in our database in many ways. However, the data that a query can produce has limitations. Namely, a query can only produce two-dimensional data consisting of columns and rows. We can write JOIN
queries to retrieve data from several tables at once, but are still limited to producing two-dimensional data.
This is not a problem when the data between joined tables has a one-to-one relationship, but when data has one-to-many or many-to-many relationships, we’ll see repetition in our query results (e.g. if a user has authored many posts, we’ll see that user repeat in any query that joins the tables together). With this in mind, we have several strategies we can apply in how we query our data.
Query strategy 1: N+1 queries
When we have one-to-many relationships in our data, we can eschew using JOIN
s altogether and instead run many simple queries, the number of which scales with the number of related pieces of data. Typically this results in running successive queries in a loop to get each related piece of data:
= queryAuthorsByCountry('USA') // 1
authors foreach author <- authors: // N
.id]['books'] = queryBooksByAuthorId(author.id) authors[author
The benefit of this approach is that it is simple and allows us to easily write application code that results in data structured in a format that accurately models the relationships between entities. But this approach has a major performance trade off. The number of queries increases linearly “N” with the number of items we’re working with. This workload scales exponentially when we are working with data that is nested across multiple levels, and the N+1 can become (M*N)+N+1.
= queryAuthorsByCountry('USA') // 1
authors foreach author <- authors: // N
= queryBooksByAuthorId(author.id)
books foreach book <- books: // N * M
= queryBookGenresByBookId(book.id)
book_genre_tags // ...
In practice, this exponential growth in the number of queries executed can quickly result in significant slowdowns in an application. An application database schema often has half-a-dozen or more tables that require querying in response to a single request, and it’s not uncommon to see this approach resulting in in hundreds or thousands of queries being executed per request. What’s more is that is also possible for a developer to accidentally write code that produces queries like this. Many ORM libraries in object-oriented languages produce queries using this pattern when accessing related data entities.
Query strategy 2: Query per table
An improvement over N+1 queries is to run one query per table. There are several variations in how this can be done in practice, but the most common approach is to use the results from each query to construct a list of keys to constrain each successive query by.
= queryAuthorsByCountry('USA')
authors = []
author_id_list foreach author <- authors:
.push(author.id)
author_id_list
= queryBooksByAuthorIds(author_id_list)
books = []
book_id_list foreach book <- books:
.push(book.id)
book_id_list
= queryBookGenresByBookIds(book_id_list) book_genre_tags
This approach results in significantly better performance than the N+1 approach in many situations. It’s easy to see that by not running the queries on the related table in a loop, the number of queries executed remains relatively small. The downside, however, is that although we have dramatically reduced the number of queries we’re executing, the data we now have in memory is in distinct lists grouped by type, which does not capture any of the parent-child relationships between the data entities.
When rendering a user interface or returning a JSON API response, it’s often necessary to structure the data in a way where childen entities are nested within their parents (e.g. each author contains a list of books, and each book contains a list of genre tags). In order to do this, we have to make multiple passes through the data in order to create the required nested structure. This requires writing additional code, the complexity of which scales both in terms of how nested the relationships are as well as how many related types of entities are retrieved at each level.
Query strategy 3: Single query with joins
In order to cut down the amount of querying we’re doing, we can write a single query that uses JOIN
operations to retrieve data from each of the related tables. This approach typically results in the best performance, as long as queries are written correctly and columns are indexed where necessary. The downside of this approach is that our data is subject to fanout, which occurs when the primary table has fewer rows than a joined table. This occurs when there is a one-to-many relationship between the data.
If we had schema like the following, it would be possible to write several queries in which a single user’s data repeats over several rows. This would occur when joining users
and posts
for any user that had written more than one post. The same would be true if we retrieved posts
with post_comments
, for posts with more than one comment.
users +---------+-------+ +->| id | int |<----+ posts | | name | text | | +-------------+------------+ | | email | text | | | id | int |<-+ | +---------+-------+ +------+ user_id | int | | | | title | text | | | | body | text | | | | created_at | timestamp | | | post_comments +-------------+------------+ | | +-------------+------------+ | | | id | int | | | | post_id | int +---------------------------------+ +--+ user_id | int | | body | text | | created_at | timestamp | +-------------+------------+
We can illustrate the fanout problem for the above schema by running the following query:
SELECT
id AS user_id,
users.
users.name,id AS post_id,
posts.
posts.title,
posts.created_atFROM users
LEFT JOIN posts
ON users.id = posts.user_id;
This query might yield results like the following following:
user_id | name | post_id | title | created_at |
---|---|---|---|---|
1 | John Smith | 10 | How to write SQL | 2020-11-18 13:06:02 |
1 | John Smith | 11 | How to write SQL pt. 2 | 2020-11-21 18:17:34 |
1 | John Smith | 12 | How to write SQL pt. 3 | 2020-12-01 15:26:22 |
2 | Karen Doe | 17 | How to really write SQL | 2021-01-18 11:19:44 |
We can see that the users
table data containing John Smith’s information is repeated multiple times in our query results, with several rows containing his primary key (a user_id
value of 1). This is because a query like this will always produce a denormalized form, with each row’s data repeated however many times is necessary to match the number of entries in the joined table.
This data fanout increases as the number of tables we join into increases. If we added a LEFT JOIN
into post_comments
to our query, and there were multiple comments for each post, we would see John Smith’s information get duplicated in our query result even more times. We’d also see information from the posts
table get duplicated.
In order to structure the data in a way that models parent-child relationships, we can write application code that will transform the data into the necessary format. We’ll apply several different strategies for doing this in Haskell.
Transforming denormalized data to model parent-child relationships
We’re big fans of the Esqueleto library, since it allows us to write what looks and feels like SQL in a Haskell EDSL, which gives us the benefits of type safety and compile-time checking of our SQL queries. Esqueleto uses data type that roughly looks like the following to model entities that have been saved with the database, with the entityKey
being the primary key of a record.
data Entity a = Entity
entityKey :: Key a
{ entityVal :: a
, }
So a post that is saved in the database would be represented by Entity Post
. The result of our query that joins users, posts, and post comments together would look like:
Entity User, Entity Post, Entity PostComment)] [(
But we want to transform this into:
Entity User, [(Entity Post, [Entity PostComment])])] [(
Approach 1: Grouping child data with a custom merge function
One of the common approaches in transforming the former into the latter is to write a pair of functions that look like the following:
{-# LANGUAGE FlexibleContexts #-}
import qualified Data.Map as Map
import Data.Traversable
import Database.Persist ( Entity (..), Key )
groupData :: (Ord (Key a)) => [(Entity a, b)] -> [(Entity a, [b])]
=
groupData res $ foldr
Map.elems ->
(\(a, b) accumulator
Map.insertWith
mergeData
(entityKey a)
(a, [b])
accumulator
) Map.empty res
mergeData :: (a, [b]) -> (a, [b]) -> (a, [b])
=
mergeData (a, b) (a', b') ++ b') (a, b
In order to apply groupData
to solve our problem, we would have to massage our data and apply the function recursively, since we have multiple levels at which the data must be grouped. This approach works, but has shortcomings. The function above helps us deal only with a single level of parent-child relationships. Depending on the specific results of a query, we may end up writing several variations of functions that look like groupData
and mergeData
. We end up with a lot of similar-but-slightly-different code when we need to apply this group operation for the results of various queries. We think that we can improve on this approach.
Approach 2: Grouping child data by writing our own Semigroup and Monoid instances
If we look at the mergeData
function, we’ll notice that it looks like the mappend
or <>
operator defined by the Semigroup
typeclass, which we can apply to any values whose types have a Semigroup
instance defined. For example, we can apply this operation to tuples so long as each member is also an instance of Semigroup
:
genericMergeData :: (Semigroup a, Semigroup b) => (a, b) -> (a, b) -> (a, b)
=
genericMergeData (a, b) (a', b') <> (a', b') (a, b)
It should be noted that we don’t even need a named function like the one above at all. We can modify groupData
to replace our call to mergeData
with (<>)
. With this observation in mind, we can try to leverage the Semigroup
typeclass and write our own instances.
data GroupedPostResult = GroupedPostResult
groupedResultUser :: Entity User
{ groupedResultPosts :: [Entity Post]
,
}
instance Semigroup GroupedPostResult where
GroupedPostResult a b <> GroupedPostResult _ b' = GroupedPostResult a (b ++ b')
What we have now is a single operator instead of a custom mergeData
function. This by itself doesn’t help us a lot, but we can also observe that applying foldr
using (<>)
is very similar to mconcat
. However, since mconcat
is an operation in the Monoid
typeclass, we will have to choose a type that has such an instance as well. In order accumulate the list of values for each of our children, we want Map (Key a) GroupedPostResult
, from the Data.Map.Strict
module in the containers
package.
But there’s a problem. The Semigroup
instance for the Map
applied is a union operation that discards the right hand side:
The
Semigroup
operation forMap
isunion
, which prefers values from the left operand. Ifm1
maps a keyk
to a valuea1
, andm2
maps the same key to a different valuea2
, then their unionm1 <> m2
mapsk
toa1
.
We don’t want to discard data, we want to merge it. In order to do this we can write our own newtype
for Map
that implements a Semigroup
instance with the behavior we want.
newtype GroupedPostResultMap = GroupedPostResultMap
unGroupedPostResultMap :: Map (Key User) GroupedPostResult
{
}
instance Semigroup GroupedPostResultMap where
GroupedPostResultMap lhs <> GroupedPostResultMap rhs = GroupedPostResultMap $ Map.unionWith (<>) lhs rhs
instance Monoid GroupedPostResultMap where
mempty = GroupedPostResultMap Map.empty
mappend = (<>)
groupData :: [(Entity User, Entity Post)] -> [(Entity User, [Entity Post])]
=
groupData fmap (\(GroupedPostResult a b) -> (a, b)) .
.
Map.elems .
unGroupedPostResultMap mconcat .
fmap (\(a, b) -> GroupedPostResultMap $
GroupedPostResult a [b])
Map.singleton (entityKey a) ( )
While this works, this is a lot of boilerplate code to write just to get the correct Map
behavior. Moreover, if the data we’re working with has more than one level of nesting, the final lambda function that is passed into fmap
will become even more cumbersome than it is in our example above.
Approach 3: Grouping child data using ad-hoc product types
Instead of making a new custom record type each time we want to create groupings of children under a parent, we can take advantage of what we saw earlier when we applied the <>
operator to tuples—the product of a multiple Semigroup
values is itself a Semigroup
. In other words, we can apply <>
to any tuple of any size so long as each member of the tuple supports <>
.
For the code in the rest of this post, we’ll need the following imports and language extension:
{-# LANGUAGE FlexibleContexts #-}
import Data.Coerce
import Data.Map.Strict as Map
import Data.Map.Append
import Data.Semigroup
import Database.Persist
We can start by thinking of the general ad-hoc structure we want (notice that this is the generalized version of the concrete type described at the beginning of this post):
type GroupedListStructure a b c =
Entity a, [(Entity b, [Entity c])])] [(
And we can make an equivalent structure out of Map
instead of lists:
type GroupedMapStructure a b c =
Map ( Key a )
Entity a
( Map ( Key b )
, Entity b
( Map (Key c) (Entity c)
,
) )
This structure will ensure that applying theunionWith
operation will still work as intended. However, we still have the issue with Map
not having the necessary Semigroup
behavior and Entity
not having any instance at all. In the previous section we solved this issue by defining the GroupedResultMap
.
Looking at the implementation, we observe that GroupedResultMap
does not use any information about its key or value type other than the values in the Monoid
instance. We can extract this by creating a newtype
like the following:
newtype AppendMap k v = AppendMap
unAppendMap :: Map k v
{ }
With this, we can define a Semigroup
instance whenever v
is a Semigroup. Fortunately, there’s already a package that exists to solve this problem. appendmap
is a tiny library that depends only on base
and containers
, the latter of which you will already be using if you’re using Map
. It exposes an AppendMap
data type that delegates the Monoid
and Semigroup
to its elements.
We can now change our desired structure to use AppendMap
instead of Map
:
type GroupedAppendMapStructure a b c =
AppendMap ( Key a )
Entity a
( AppendMap ( Key b )
, Entity b
( AppendMap (Key c) (Entity c)
,
) )
There’s one last problem. We still don’t have a valid Semigroup without a Semigroup
instance for Entity
. Since the key in our structure is the Key
within our Entity
, the Semigroup
instance only need the first value, and can ignore subsequent occurrences of Entity
.
Conveniently, theData.Semigroup
module provides us with a newtype
wrapper named First
that does exactly what we need.
So, we can update our ad-hoc structure to use First
and finally become:
type GroupedAppendMapSemigroupStructure a b c =
AppendMap ( Key a )
First (Entity a)
( AppendMap ( Key b )
, First (Entity b)
( AppendMap (Key c) (First (Entity c))
,
) )
We can write a function that will give us this structure:
makeGroupedStructure :: (Entity a, Entity b, Entity c)
-> GroupedAppendMapSemigroupStructure a b c
=
makeGroupedStructure (a, b, c) AppendMap $ Map.singleton
( entityKey a )First a
( AppendMap $ Map.singleton
,
( entityKey b )First b
( AppendMap $ Map.singleton (entityKey c) (First c)
,
) )
We can now update the groupData
function we created during our first approach to use this implementation instead:
groupWith :: (Monoid m, Coercible m b) => (r -> m) -> [r] -> b
=
groupWith fn . mconcat . fmap fn
coerce
groupData :: (Ord (Key a), Ord (Key b), Ord (Key c))
=> [(Entity a, Entity b, Entity c)]
-> Map (Key a) (Entity a, Map (Key b) (Entity b, Map (Key c) (Entity c)))
=
groupData groupWith makeGroupedStructure
If necessary, we can transform the resulting Map
structure into the list structure we described earlier by applying Map.elems
at each level. We can write a helper function to do this:
transformMap :: Map (Key a) (Entity a, Map (Key b) (Entity b, Map (Key c) (Entity c)))
-> [(Entity a, [(Entity b, [Entity c ])])]
map =
transformMap fmap (\(parentA, childrenA) ->
( parentAfmap (\(parentB, childrenB) ->
,
( parentB$ childrenB
, Map.elems
)$ childrenA)
) (Map.elems
)map) ) (Map.elems
Finally, we can compose the two:
groupQueryResults :: (Ord (Key a), Ord (Key b), Ord (Key c))
=> [(Entity a, Entity b, Entity c)]
-> [(Entity a, [(Entity b, [Entity c])])]
=
groupQueryResults . groupData transformMap
With this, we’ve successfully grouped the children at each level and structured structured the data in the way we set out to in the beginning.
Putting it all together
Let’s apply this approach to the schema we looked at earlier.
-- The type representing each row from our Esqueleto query
type PostModel = (Entity User, Entity Post, Entity PostComment)
-- The resulting grouped type we want to achieve
type GroupedPostModel =
Map UserId
Entity User
( Map PostId
, Entity Post
( Map PostCommentId (Entity PostComment)
,
)
)
-- The intermediary semigroup type that defines grouping behavior
-- This corresponds directly to GroupedPostModel above
type GroupedPostSemigroup =
AppendMap ( UserId )
First (Entity User)
( AppendMap ( PostId )
, First (Entity Post)
( AppendMap ( PostCommentId )
, First (Entity PostComment) )
(
)
)
-- Takes our query result and produces values in the semigroup shape above
makeSinglePostGroup :: PostModel -> GroupedPostSemigroup
=
makeSinglePostGroup (user, post, postComment) AppendMap $ Map.singleton
( entityKey user )First user
( AppendMap $ Map.singleton
,
( entityKey post )First post
( AppendMap $ Map.singleton (entityKey postComment) (First postComment)
,
)
)
-- Copied directly from the previous section
groupWith :: (Monoid m, Coercible m b) => (r -> m) -> [r] -> b
=
groupWith fn . mconcat . fmap fn
coerce
groupData :: [PostModel] -> GroupedPostModel
=
groupData groupWith makeSinglePostGroup
Applying this approach is straightforward after we decide on how the data should be grouped. Once the grouping is defined in our GroupedPostModel
type, writing the rest is purely mechanical. What’s more is that this approach gives us the flexibility to define our groups however we’d like; we can add additional levels of nesting or add more items at each level without increasing the complexity of our groupData
function.
Conclusion
Writing the most performant SQL comes with a trade off: the data that our application receives from the database will have fanout if we have any one-to-many relationships in our data. The typical way to solve this problem involves defining a function that specifies merge behavior and using the Map
data type to create our parent-child data, which is the first approach we explored. This works, but has shortcomings.
We set out to find a better solution to this problem, noticing that theSemigroup
typeclass gives us the append operation we need to generalize the grouping behavior. After a few iterations, the approach we settled on relies on creating an ad-hoc data type to define our intended structure, and relies on instances of Semigroup
and Monoid
to perform the grouping as necessary. This new approach is an improvement in that allows us to focus primarily on type of the grouped data, with the in-between being a straightforward mechanical translation of this type. This saves us from writing cumbersome merge operations and instead relies on Haskell’s type system to achieve the intended results.
Looking for help with something you’re working on? We’d love to hear from you. At Foxhound Systems, we focus on using Haskell to create custom built software your business can depend on. Reach out to us at info@foxhound.systems