Facebook Twitter GitHub LinkedIn LinkedIn LinkedIn
A photograph of a line of rubber ducks floating down what appears to be a concrete gutter in between two cobblestone paths.

Reducing the pain of grouping SQL query results using Haskell

haskell sql performance-optimization

Relational databases allow us to model the associations between different types of data in our system domain. Most application database schemas rely on normalization to avoid data duplication. We use SQL to retrieve this data from a database, but SQL has limitations. When we need data from several tables, we’re forced to make trade offs in how we query our data, and our query results often do not contain an ideal representation of the relationships between our data entites.

In order to mitigate this limitation of SQL, we typically transform the data we retrieve via our queries in our application layer. With a system written in Haskell, we can use the Semigroup typeclass and the append operation it exposes (<>) to transform the data into the shape we need by defining our desired custom data types and simple transformation functions. In this post we’ll explore this method of solving this problem in more detail.

Limitations of SQL

SQL is an excellent tool for querying data, with queries allowing us to retrieve data from one or several tables in our database in many ways. However, the data that a query can produce has limitations. Namely, a query can only produce two-dimensional data consisting of columns and rows. We can write JOIN queries to retrieve data from several tables at once, but are still limited to producing two-dimensional data.

This is not a problem when the data between joined tables has a one-to-one relationship, but when data has one-to-many or many-to-many relationships, we’ll see repetition in our query results (e.g. if a user has authored many posts, we’ll see that user repeat in any query that joins the tables together). With this in mind, we have several strategies we can apply in how we query our data.

Query strategy 1: N+1 queries

When we have one-to-many relationships in our data, we can eschew using JOINs altogether and instead run many simple queries, the number of which scales with the number of related pieces of data. Typically this results in running successive queries in a loop to get each related piece of data:

authors = queryAuthorsByCountry('USA')          // 1
foreach author <- authors:                      // N
    authors[author.id]['books'] = queryBooksByAuthorId(author.id)

The benefit of this approach is that it is simple and allows us to easily write application code that results in data structured in a format that accurately models the relationships between entities. But this approach has a major performance trade off. The number of queries increases linearly “N” with the number of items we’re working with. This workload scales exponentially when we are working with data that is nested across multiple levels, and the N+1 can become (M*N)+N+1.

authors = queryAuthorsByCountry('USA')          // 1
foreach author <- authors:                      // N
    books = queryBooksByAuthorId(author.id)
    foreach book <- books:                      // N * M
        book_genre_tags = queryBookGenresByBookId(book.id)
        // ...

In practice, this exponential growth in the number of queries executed can quickly result in significant slowdowns in an application. An application database schema often has half-a-dozen or more tables that require querying in response to a single request, and it’s not uncommon to see this approach resulting in in hundreds or thousands of queries being executed per request. What’s more is that is also possible for a developer to accidentally write code that produces queries like this. Many ORM libraries in object-oriented languages produce queries using this pattern when accessing related data entities.

Query strategy 2: Query per table

An improvement over N+1 queries is to run one query per table. There are several variations in how this can be done in practice, but the most common approach is to use the results from each query to construct a list of keys to constrain each successive query by.

authors = queryAuthorsByCountry('USA')
author_id_list = []
foreach author <- authors:
    author_id_list.push(author.id)

books = queryBooksByAuthorIds(author_id_list)
book_id_list = []
foreach book <- books:
    book_id_list.push(book.id)

book_genre_tags = queryBookGenresByBookIds(book_id_list)

This approach results in significantly better performance than the N+1 approach in many situations. It’s easy to see that by not running the queries on the related table in a loop, the number of queries executed remains relatively small. The downside, however, is that although we have dramatically reduced the number of queries we’re executing, the data we now have in memory is in distinct lists grouped by type, which does not capture any of the parent-child relationships between the data entities.

When rendering a user interface or returning a JSON API response, it’s often necessary to structure the data in a way where childen entities are nested within their parents (e.g. each author contains a list of books, and each book contains a list of genre tags). In order to do this, we have to make multiple passes through the data in order to create the required nested structure. This requires writing additional code, the complexity of which scales both in terms of how nested the relationships are as well as how many related types of entities are retrieved at each level.

Query strategy 3: Single query with joins

In order to cut down the amount of querying we’re doing, we can write a single query that uses JOIN operations to retrieve data from each of the related tables. This approach typically results in the best performance, as long as queries are written correctly and columns are indexed where necessary. The downside of this approach is that our data is subject to fanout, which occurs when the primary table has fewer rows than a joined table. This occurs when there is a one-to-many relationship between the data.

If we had schema like the following, it would be possible to write several queries in which a single user’s data repeats over several rows. This would occur when joining users and posts for any user that had written more than one post. The same would be true if we retrieved posts with post_comments, for posts with more than one comment.

   users
   +---------+-------+
+->| id      | int   |<----+      posts
|  | name    | text  |     |      +-------------+------------+
|  | email   | text  |     |      | id          | int        |<-+
|  +---------+-------+     +------+ user_id     | int        |  |
|                                 | title       | text       |  |
|                                 | body        | text       |  |
|                                 | created_at  | timestamp  |  |
|  post_comments                  +-------------+------------+  |
|  +-------------+------------+                                 |
|  | id          | int        |                                 |
|  | post_id     | int        +---------------------------------+
+--+ user_id     | int        |
   | body        | text       |
   | created_at  | timestamp  |
   +-------------+------------+

We can illustrate the fanout problem for the above schema by running the following query:

SELECT
    users.id AS user_id,
    users.name,
    posts.id AS post_id,
    posts.title,
    posts.created_at
FROM users
LEFT JOIN posts
ON users.id = posts.user_id;

This query might yield results like the following following:

user_id name post_id title created_at
1 John Smith 10 How to write SQL 2020-11-18 13:06:02
1 John Smith 11 How to write SQL pt. 2 2020-11-21 18:17:34
1 John Smith 12 How to write SQL pt. 3 2020-12-01 15:26:22
2 Karen Doe 17 How to really write SQL 2021-01-18 11:19:44

We can see that the users table data containing John Smith’s information is repeated multiple times in our query results, with several rows containing his primary key (a user_id value of 1). This is because a query like this will always produce a denormalized form, with each row’s data repeated however many times is necessary to match the number of entries in the joined table.

This data fanout increases as the number of tables we join into increases. If we added a LEFT JOIN into post_comments to our query, and there were multiple comments for each post, we would see John Smith’s information get duplicated in our query result even more times. We’d also see information from the posts table get duplicated.

In order to structure the data in a way that models parent-child relationships, we can write application code that will transform the data into the necessary format. We’ll apply several different strategies for doing this in Haskell.


Cloudtrellis
A new service built by Foxhound Systems Discover problems with your website before your users do

Cloudtrellis scans your entire site for broken links, accessibility issues, and SEO errors to ensure a flawless user experience.

  • Detect error pages, broken links, accessibility issues, and SEO problems
  • Create scans with tailored configurations for each website and subdomain you manage
  • Schedule scans to run monthly, weekly, or even daily to closely monitor for new issues
  • Get notified of new scan results via email
  • Share scan results with your team via direct link
Learn more

Transforming denormalized data to model parent-child relationships

We’re big fans of the Esqueleto library, since it allows us to write what looks and feels like SQL in a Haskell EDSL, which gives us the benefits of type safety and compile-time checking of our SQL queries. Esqueleto uses data type that roughly looks like the following to model entities that have been saved with the database, with the entityKey being the primary key of a record.

data Entity a = Entity
    { entityKey :: Key a
    , entityVal :: a
    }

So a post that is saved in the database would be represented by Entity Post. The result of our query that joins users, posts, and post comments together would look like:

[(Entity User, Entity Post, Entity PostComment)]

But we want to transform this into:

[(Entity User, [(Entity Post, [Entity PostComment])])]

Approach 1: Grouping child data with a custom merge function

One of the common approaches in transforming the former into the latter is to write a pair of functions that look like the following:

{-# LANGUAGE FlexibleContexts #-}

import qualified Data.Map         as Map
import           Data.Traversable
import           Database.Persist ( Entity (..), Key )

groupData :: (Ord (Key a)) => [(Entity a, b)] -> [(Entity a, [b])]
groupData res =
    Map.elems $ foldr
        (\(a, b) accumulator ->
            Map.insertWith
                mergeData
                (entityKey a)
                (a, [b])
                accumulator
        ) Map.empty res

mergeData :: (a, [b]) -> (a, [b]) -> (a, [b])
mergeData (a, b) (a', b') =
    (a, b ++ b')

In order to apply groupData to solve our problem, we would have to massage our data and apply the function recursively, since we have multiple levels at which the data must be grouped. This approach works, but has shortcomings. The function above helps us deal only with a single level of parent-child relationships. Depending on the specific results of a query, we may end up writing several variations of functions that look like groupData and mergeData. We end up with a lot of similar-but-slightly-different code when we need to apply this group operation for the results of various queries. We think that we can improve on this approach.

Approach 2: Grouping child data by writing our own Semigroup and Monoid instances

If we look at the mergeData function, we’ll notice that it looks like the mappend or <> operator defined by the Semigroup typeclass, which we can apply to any values whose types have a Semigroup instance defined. For example, we can apply this operation to tuples so long as each member is also an instance of Semigroup:

genericMergeData :: (Semigroup a, Semigroup b) => (a, b) -> (a, b) -> (a, b)
genericMergeData (a, b) (a', b') =
    (a, b) <> (a', b')

It should be noted that we don’t even need a named function like the one above at all. We can modify groupData to replace our call to mergeData with (<>). With this observation in mind, we can try to leverage the Semigroup typeclass and write our own instances.

data GroupedPostResult = GroupedPostResult
    { groupedResultUser :: Entity User
    , groupedResultPosts :: [Entity Post]
    }

instance Semigroup GroupedPostResult where
    GroupedPostResult a b <> GroupedPostResult _ b' = GroupedPostResult a (b ++ b')

What we have now is a single operator instead of a custom mergeData function. This by itself doesn’t help us a lot, but we can also observe that applying foldr using (<>) is very similar to mconcat. However, since mconcat is an operation in the Monoid typeclass, we will have to choose a type that has such an instance as well. In order accumulate the list of values for each of our children, we want Map (Key a) GroupedPostResult, from the Data.Map.Strict module in the containers package.

But there’s a problem. The Semigroup instance for the Map applied is a union operation that discards the right hand side:

The Semigroup operation for Map is union, which prefers values from the left operand. If m1 maps a key k to a value a1, and m2 maps the same key to a different value a2, then their union m1 <> m2 maps k to a1.

We don’t want to discard data, we want to merge it. In order to do this we can write our own newtype for Map that implements a Semigroup instance with the behavior we want.

newtype GroupedPostResultMap = GroupedPostResultMap
    { unGroupedPostResultMap :: Map (Key User) GroupedPostResult
    }

instance Semigroup GroupedPostResultMap where
    GroupedPostResultMap lhs <> GroupedPostResultMap rhs = GroupedPostResultMap $ Map.unionWith (<>) lhs rhs

instance Monoid GroupedPostResultMap where
    mempty = GroupedPostResultMap Map.empty
    mappend = (<>)

groupData :: [(Entity User, Entity Post)] -> [(Entity User, [Entity Post])]
groupData =
    fmap (\(GroupedPostResult a b) -> (a, b)) .
        Map.elems .
        unGroupedPostResultMap .
        mconcat .
        fmap (\(a, b) -> GroupedPostResultMap $
                            Map.singleton (entityKey a) (GroupedPostResult a [b])
             )

While this works, this is a lot of boilerplate code to write just to get the correct Map behavior. Moreover, if the data we’re working with has more than one level of nesting, the final lambda function that is passed into fmap will become even more cumbersome than it is in our example above.

Approach 3: Grouping child data using ad-hoc product types

Instead of making a new custom record type each time we want to create groupings of children under a parent, we can take advantage of what we saw earlier when we applied the <> operator to tuples—the product of a multiple Semigroup values is itself a Semigroup. In other words, we can apply <> to any tuple of any size so long as each member of the tuple supports <>.

For the code in the rest of this post, we’ll need the following imports and language extension:

{-# LANGUAGE FlexibleContexts #-}

import           Data.Coerce
import           Data.Map.Strict  as Map
import           Data.Map.Append
import           Data.Semigroup
import           Database.Persist

We can start by thinking of the general ad-hoc structure we want (notice that this is the generalized version of the concrete type described at the beginning of this post):

type GroupedListStructure a b c =
    [(Entity a, [(Entity b, [Entity c])])]

And we can make an equivalent structure out of Map instead of lists:

type GroupedMapStructure a b c =
    Map ( Key a )
        ( Entity a
        , Map ( Key b )
              ( Entity b
              , Map (Key c) (Entity c)
              )
        )

This structure will ensure that applying theunionWith operation will still work as intended. However, we still have the issue with Map not having the necessary Semigroup behavior and Entity not having any instance at all. In the previous section we solved this issue by defining the GroupedResultMap.

Looking at the implementation, we observe that GroupedResultMap does not use any information about its key or value type other than the values in the Monoid instance. We can extract this by creating a newtype like the following:

newtype AppendMap k v = AppendMap
    { unAppendMap :: Map k v
    }

With this, we can define a Semigroup instance whenever v is a Semigroup. Fortunately, there’s already a package that exists to solve this problem. appendmap is a tiny library that depends only on base and containers, the latter of which you will already be using if you’re using Map. It exposes an AppendMap data type that delegates the Monoid and Semigroup to its elements.

We can now change our desired structure to use AppendMap instead of Map:

type GroupedAppendMapStructure a b c =
    AppendMap ( Key a )
              ( Entity a
              , AppendMap ( Key b )
                          ( Entity b
                          , AppendMap (Key c) (Entity c)
                          )
              )

There’s one last problem. We still don’t have a valid Semigroup without a Semigroup instance for Entity. Since the key in our structure is the Key within our Entity, the Semigroup instance only need the first value, and can ignore subsequent occurrences of Entity.

Conveniently, theData.Semigroup module provides us with a newtype wrapper named First that does exactly what we need.

So, we can update our ad-hoc structure to use First and finally become:

type GroupedAppendMapSemigroupStructure a b c =
    AppendMap ( Key a )
              ( First (Entity a)
              , AppendMap ( Key b )
                          ( First (Entity b)
                          , AppendMap (Key c) (First (Entity c))
                          )
              )

We can write a function that will give us this structure:

makeGroupedStructure :: (Entity a, Entity b, Entity c)
                     -> GroupedAppendMapSemigroupStructure a b c
makeGroupedStructure (a, b, c) =
    AppendMap $ Map.singleton
            ( entityKey a )
            ( First a
            , AppendMap $ Map.singleton
                ( entityKey b )
                ( First b
                , AppendMap $ Map.singleton (entityKey c) (First c)
                )
            )

We can now update the groupData function we created during our first approach to use this implementation instead:

groupWith :: (Monoid m, Coercible m b) => (r -> m) -> [r] -> b
groupWith fn =
    coerce . mconcat . fmap fn


groupData :: (Ord (Key a), Ord (Key b), Ord (Key c))
          => [(Entity a, Entity b, Entity c)]
          -> Map (Key a) (Entity a, Map (Key b) (Entity b, Map (Key c) (Entity c)))
groupData =
    groupWith makeGroupedStructure

If necessary, we can transform the resulting Map structure into the list structure we described earlier by applying Map.elems at each level. We can write a helper function to do this:

transformMap :: Map (Key a) (Entity a, Map (Key b) (Entity b, Map (Key c) (Entity c)))
             -> [(Entity a, [(Entity b, [Entity c ])])]
transformMap map =
    fmap (\(parentA, childrenA) ->
             ( parentA
             , fmap (\(parentB, childrenB) ->
                        ( parentB
                        , Map.elems $ childrenB
                        )
                    ) (Map.elems $ childrenA)
             )
         ) (Map.elems map)

Finally, we can compose the two:

groupQueryResults :: (Ord (Key a), Ord (Key b), Ord (Key c))
                  => [(Entity a, Entity b, Entity c)]
                  -> [(Entity a, [(Entity b, [Entity c])])]
groupQueryResults =
    transformMap . groupData

With this, we’ve successfully grouped the children at each level and structured structured the data in the way we set out to in the beginning.

Putting it all together

Let’s apply this approach to the schema we looked at earlier.

-- The type representing each row from our Esqueleto query
type PostModel = (Entity User, Entity Post, Entity PostComment)

-- The resulting grouped type we want to achieve
type GroupedPostModel =
    Map UserId
        ( Entity User
        , Map PostId
              ( Entity Post
              , Map PostCommentId (Entity PostComment)
              )
        )

-- The intermediary semigroup type that defines grouping behavior
-- This corresponds directly to GroupedPostModel above
type GroupedPostSemigroup =
    AppendMap ( UserId )
              ( First (Entity User)
              , AppendMap ( PostId )
                          ( First (Entity Post)
                          , AppendMap ( PostCommentId )
                                      ( First (Entity PostComment) )
                          )
              )

-- Takes our query result and produces values in the semigroup shape above
makeSinglePostGroup :: PostModel -> GroupedPostSemigroup
makeSinglePostGroup (user, post, postComment) =
    AppendMap $ Map.singleton
            ( entityKey user )
            ( First user
            , AppendMap $ Map.singleton
                ( entityKey post )
                ( First post
                , AppendMap $ Map.singleton (entityKey postComment) (First postComment)
                )
            )

-- Copied directly from the previous section
groupWith :: (Monoid m, Coercible m b) => (r -> m) -> [r] -> b
groupWith fn =
    coerce . mconcat . fmap fn

groupData :: [PostModel] -> GroupedPostModel
groupData =
    groupWith makeSinglePostGroup

Applying this approach is straightforward after we decide on how the data should be grouped. Once the grouping is defined in our GroupedPostModel type, writing the rest is purely mechanical. What’s more is that this approach gives us the flexibility to define our groups however we’d like; we can add additional levels of nesting or add more items at each level without increasing the complexity of our groupData function.

Conclusion

Writing the most performant SQL comes with a trade off: the data that our application receives from the database will have fanout if we have any one-to-many relationships in our data. The typical way to solve this problem involves defining a function that specifies merge behavior and using the Map data type to create our parent-child data, which is the first approach we explored. This works, but has shortcomings.

We set out to find a better solution to this problem, noticing that theSemigroup typeclass gives us the append operation we need to generalize the grouping behavior. After a few iterations, the approach we settled on relies on creating an ad-hoc data type to define our intended structure, and relies on instances of Semigroup and Monoid to perform the grouping as necessary. This new approach is an improvement in that allows us to focus primarily on type of the grouped data, with the in-between being a straightforward mechanical translation of this type. This saves us from writing cumbersome merge operations and instead relies on Haskell’s type system to achieve the intended results.


Looking for help with something you’re working on? We’d love to hear from you. At Foxhound Systems, we focus on using Haskell to create custom built software your business can depend on. Reach out to us at info@foxhound.systems