Culture & teams

Data futures

Frameworks & best practices

Tool deep dives

Workshops

Coalesce 2021 Workshops Trials and Tribulations of Incremental Models

Trials and Tribulations of Incremental Models

Vincey is best known for wading through hard technical problems which saw her leading Canva’s standards on elegant and efficient code. She is a huge Python and dbt fan, loves patting dogs and has recently developed an appreciation for picnics.

Incremental models are core to dbt, and are a key solution to how data teams balance the need for computationally efficient and timeline data.

This talk will discuss some challenges with building incremental models at scale, and how our approach has evolved over time in a high growth company with over 30 third party sources of data. On a technical level, this will cover the is_incremental() SQL block logic, as well as custom materializations that make building models simpler and more efficient.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

Amada Echeverría: [00:00:00] Welcome everyone. And thank you for joining us at Coalesce 2021. This is our last session and I don’t know about you, but I’m feeling very excited. My name is Amada Echeverría and I’m on the community team at dbt Labs. I’ll be the host of this session, Trials and Tribulations of Incremental Models presented by two time Coalesce speaker Vincey Au.

Vincey is currently a data analyst at Canva where she leads the videos, the video products analytics. Fun fact, she is known for having laser eyes and PR reviews and prides herself in lifting standards. She can detect style guide deviations, or anytime folks are not adhering to the style guide in the blink of an eye. On a more personal note, Vincey is a huge Python and dbt fan, dbt more obviously. She loves petting dogs and going on picnics and last but definitely not least Vincey has been an active contributor to the community [00:01:00] Slack for almost two years and has been an active meta participant at the Sydney dbt meetup since before that.

Incremental models are core to dbt and our key solution to how data teams balance the need for computationally efficient and timely data. This 30 minute talk will discuss some challenges with building incremental models at scale and how Vincey’s team’s approach has evolved over time in a high growth company with over 30 third-party sources of data.

Before we pass it over to Vincey, I want to share some pro tips for making the most out of this session. All chat conversation is taking place in the coalesce-trials-tribulations-models channel with dbt Slack. If you’re not yet a part of the dbt Slack community, you have time to join now. Seriously, go do it, visit getdbt.com/community and search for Coalesce trials tribulations models when you arrive. We encourage you to set up Slack and your browser side-by-side. And [00:02:00] in Slack, you’ll have a great experience if you ask questions of our speaker and other attendees, make comments, share memes or react in the channel at any point during Vincey’s session. And if you’re new to the dbt community, or this is your first time at Coalesce, please chime in if you like, everyone here loves hearing from you.

To kick us off, our chat champion, Pat Kearns, a senior analytics engineer at dbt Labs has started a thread to have you introduce yourself. Let us know where you’re calling in from and inform us on a very important preference: ice cream flavor. After the session, Vincey will be available in the Slack channel to answer questions. Let’s get started. Over to you.

Vincey Au: Hi, everyone. So my name is Vincey and yes, up until last week I was a data analyst at Canva. But today, what I want to talk to you about is one of the most life-changing things about dbt, incremental models and all the trials and tribulations we [00:03:00] had been through with it at Canva.

There’s a soft spot in my heart for them. So a little bit about Canva to set the context. So we’re on a mission to empower the world to design, and we do this by building design software that enables ordinary people like you and me to create beautiful Facebook posts, shoots, videos, and presentations, just like this one.

We’re a software as a service platform that operates on a freemium model, and we also operate on multiple platforms, including your web browser and mobile. And we’ve grown pretty quickly. So back in 2019, we had 15 million monthly active users. When I spoke at Coalesce this time last year, we had doubled to 30 million monthly active users.

And now just one year later we’ve grown over 2X again, so you have 65 million monthly active users. So we’re working with quite a few users and our data has grown just as fast. So we’ve gone from 4 to [00:04:00] over 80 analysts in just three years. And when we crunch some numbers, we have over 2PB of data.

And we have over a hundred thousand events per second generated. And more than half our data was generated in the last six months, which is absolutely terrifying. We use the modern data stack that a classic combination of Snowflake, Fivetran, dbt, and Looker as our BI tool. And so obviously Canva has grown an incredible bull amounts.

We have more users to generate more data. We’ve launched more products and more features, which are also generating more data. But alongside this, we’ve also evolved our data maturity. So two years ago, our stakeholders were still asking fairly straightforward questions, like how many signups and MAUs do we have?

But these days they’re asking more complex questions. What are the actions [00:05:00] that drive a user to retain? And so these complex questions need more complex modeling, and of course your stakeholders would like the answers to these questions yesterday. And so the real problem is that our data has grown so much that the processes we were using a year ago, really don’t work full times the amounts of data.

[00:05:23] More data, more problems #

Vincey Au: I think this meme I really means really summarizes our situation. Like more data, more problems. dbt has given many of us a lot of joy. And for me being able to solve some of our big data problems with incremental models was one. And so today what you’re going to learn in this talk is firstly, some problems that you might run into when you’re trying to use incremental models in high growth organizations, and the second is some solutions that we have used at Canva to solve these problems.

So hopefully, you can learn [00:06:00] something to take back to your job. And thirdly, we’re going to discuss the trade-offs between impact and complexity of these solutions so you can decide whether you should use these. One thing to note about everything I’m talking about today, all of the solutions I told you about gradually came up from multiple different problems over the last two years.

So they can be a little bit distinct and they don’t necessarily need to be combined to yellow. But let’s talk about incremental models. So for those of you who don’t know what an incremental model is here is a quick summary. So let’s say we have a signup completed event that we had currently materializing as a table every single day.

So we load every row in this table every day. And this is what the SQL might look like. So it’s a simple select star from at events and over time this might creep off to over an ammo, which you [00:07:00] might just suddenly decide, Hey, this is a bit much. And one way we can make this faster is to materialize it incrementally.

And so there are two main components to this. So first is to change it at the dbt config at the top of your file. So here we just have materialize as incremental. And the second is to add a piece of logic that only runs incrementally. And so the the first time you run this model, the table doesn’t exist in your data warehouse so this is incremental condition is going to return false. And so dbt is going to include your increments of logic. And when you run the model, it acts like an always has, which is it’s going to process the whole data set and story into a table, which we’re calling fact use a signup. How about if a table already exists and so on subsequent runs, then dbt will recognize this and it’s going to include [00:08:00] the, is incremental logic inside the compile SQL.

So a logic here is pretty straightforward. All we’re trying to do is to query all the new events that have come in since we last updated our five years of signup table so we’re just getting the max event time from an existing table. dbt stores this in a little temporary table, as you can tell from the name dbt on this content

and then this is the generated query, but emerges all this new data into the existing table. And so with this, you can expect that the model now takes substantially less time to run. So let’s say five minutes per day, and there are quite a few advantages here. So firstly, we’re saving the time taken to run the model, which means fresh data with your favorite stakeholder.

And the second is because it’s taking less compute. We’re also saving money. So what’s the catch? It turns out that incremental [00:09:00] models introduced a lot of mental overhead that otherwise wouldn’t occur, which leads us to our first problem. What if the data arrives late? What do you do if you have some __ events.

The first thing to do is to actually size up how big of a problem it is with your data. So in our events, we get both the timestamp of the event and when we received it, and from this, we found out that for most events, we received 99% of data within one day. And sometimes this was dependent on the event.

[00:09:33] Figure out the lookback window #

Vincey Au: So for example, we found out that mobile events have an even longer lag. But ultimately what you’re doing here is you’re trying to make a trade off between perfectly complete data and computation and the smaller your lookback window, the less you have to compute on an incremental run.

Now, once we know our lookback window, then we can add it to our logic. [00:10:00] See, so here we just had a piece of where it’s like the max event time from an existing table and subtract this one day lookback window. Now at Canva, we apply this to all our data. So we decided to standardize this with a macro very early on.

So we created a macro it’s called get incremental start time, which takes in the table the time column, and also the lookback window. And there are a couple of things I want to point out here. So first is need to find a default event like variable in the dbt project download file. And this is really useful because if we introduced a change that somehow affects all models, then we can change it all in one place.

So for example, if users could use the product offline, then the events might take days to send them. We might want to increase this. And the second thing is that instead of actually replacing the SQL with the sub query, we actually run this query a compile [00:11:00] time and insert the calculated date time in, into the SQL.

And the reason why we did this was because this ended up being something like five times faster on some really complex models, which have a lot of sub queries in CTE. So it just helps the warehouse engine work a little bit better. And thirdly, I didn’t know that, but I feel like I always experienced getting this, like getting event data from like the 1970s before the product ever existed.

So the third thing we did was to make sure that we don’t pick up any data before Canva was launched. And this is really nice when you just want to plot all your signups over time in a downstream dashboard tool. Nice and clean. And so here’s how we use it. We simply call the macro at the top of the models online to here.

And then we’ll use it in our SQL online sever and notice here that we don’t need to add the incremental condition into [00:12:00] the model SQL itself, because it’s baked in, into the macro. And this makes the core business logic of the model easier to read. And so the macro is largely the same as doing it the long way, which is when you have this sub query it’s like the max event time minus look back window.

But in our case, our macro actually compiles to just the daytime. So this is what it looks like if you look at your data warehouse query history.

We’ve been using this macro for about two years now it’s really great that it’s easy to change the default lookback window, project wise, and it’s also more efficient computing the cut off during compile time. And something that is both a pro and a con, is that it abstracts all the logic. So it makes business logic simpler but it’s also more difficult when you’re an analyst to understand what is happening that maybe they don’t really need to. [00:13:00] So our second problem is then what if you’re combining multiple incremental models with varying source freshness. And so let’s consider a scenario where we have an e-commerce website and their definition of an active user is someone who authenticated.

So they signed up or they logged in, all of them, made a purchase on the website. So now let’s consider where events come from. So our authentication event might come directly from the browser, whereas purchases might be stored into a production database first. So for example, my SQL which is then copied into the data warehouse every day.

So they come from three different sources that are loaded into the warehouse independently of each other. And how we would generally count active users using SQL would be to first create a table of older active events, every user find. So in this case, we’re just going to union, I’ll use the authentication from a front end [00:14:00] event with a purchase data purchase data from a backend system.

And this is what it might look like if a table was built incrementally. So notice here that both incremental logic blocks are using the exact same piece of SQL, so which they’re just trying to retrieve the latest event that already exists now table.

And this latest mass event time could come from either the user authentication event or purchases. And so this looks really well when the sources are up to date. So here we have both tables up today until Sunday, the 5th of December. So five days ago, and let’s say the sixth is okay as well. And the next day, when you might start to get some issues.

So here we have, our authentication event has the latest startup, but something’s gone a little wrong with loading purchases into our warehouse. The [00:15:00] same thing happens on the eighth and it keeps going. So also on the ninth. So let’s consider what happens to our model now. So the max timestamp for the combined model is now the 9th of December, but I’ll one day look back when there’s only going to go back between the seventh and eighth.

So where a dotted line is and it’s actually not going to go back past when purchases were last recorded. And so if purchase data suddenly catches up, then we’re only going to load the data in from the dotted line and what grades completely miss out on loading data from the 7th of December into our table.

And so we lose data and our model is incorrect. And the issue here is really that we’re not taking into account every source that were dependent. When we’re doing the incremental logic here, we’re just fetching Hey, give me the latest event time from the [00:16:00] combined events. And one really nice way to fix this is to use an min-max algorithm.

So for each the source of data, so authentication and purchases, you want to fetch the latest start dates from that. And then we want to fetch the minimum of that and that’s the day that we rebuild of that we build the incremental model from. So in a previous example, we got to the ninth and then our purchase data suddenly caught up.

Oh, I’m sorry. In this case when we were on the ninth purchases were last loaded three days ago and because the latest purchase event dates are smaller, then we rebuild from one day before that. Which is fine, but yeah, so we rebuild from before the field. And so when we next load the table in, all our data will get refreshed.[00:17:00]

[00:17:00] Pros and cons #

Vincey Au: And so this is really nice because it works and it works really nicely if you don’t have to worry about your downstream models being incremental. So it works great if they had tables of views, the con is it’s really complicated to repeat. It’s actually impossible to do that once you aggregate the data even for that, and you lose like the differentiator between what is, what comes from authentication and what comes from puchases. And so we pass this over a couple of times and knowing that the fixed stops that are suddenly on the warehouse. And we brought this problem to our data engineers, who made a suggestion.

They were like, what about if we use the warehouse load timestamp?

So let me first define what the warehouse load timestamp is. So when we load, they start into the warehouse, we also store the exact time that it was loaded in, into a colon cord, insert it up and we would consider this metadata. And this is quite different from our event time.

And so [00:18:00] here’s what our previous active event table might look like. So you’ll notice here that yeah, the logic is much simpler. We no longer need a lookback window to do this because we simply just insert ordinary events that have come in since the last time we inserted it. So the logic is so much simpler.

Pro, it works. Con, for an analyst it seems really silly and maybe even a little illogical to not use the event time since that is where your fact and report tables need to use. And so internally, this kind of led to a fairly poor uptake of this solution. And the other thing is that this doesn’t really solve the problem, but it does push the problem down another layer in your data warehouse models.

[00:18:51] Use freshness checks #

Vincey Au: So that actually might be good enough in a lot of cases. And so something we discussed occasionally was using [00:19:00] freshness checks on our sources and then not running models with stale sources. Now we haven’t implemented freshness checks in Canva yet, and there are a few blockages as to why. So first is we had a lot of mixed opinions on when our failing fast was the right thing to do for our warehouse. Does it really make sense for the whole warehouse to be stale if some unimportant source is really stale. We have a lot of marketing making decisions on millions of advertising goals. So having fresh data is very important for that. The second thing was we needed better workforce, so we don’t use an orchestration tool and we felt like we really needed to have a proper orchestration tool to handle this.

So maybe there’s a small feature request is to maybe add a freshness check in to dbt but anyway, freshness checks is probably where we’re heading to now, because going back to the stale tables is a total [00:20:00] nightmare. And we’re probably now at a scale where stale but correct data is more valuable than super fresh sometimes correct, but could potentially be wrong data.

What if your incremental run model still takes too long? Or what if your database runs out of memory? So it’s usually pretty sensible set a query timeout limit on your data warehouse. So it just helps to stuff any completely awful SQL scripts that may or may not be written the analysts from burning a lot of values.

And so usually what we’ll do next is to look into the query planand see where the warehouse optimizer is struggling. So here’s an example of a query plan from Snowflake. There’s a few things worth pointing out when you look at one of these. So first is on the top right here. We can check what operation most of the time is being spent on.

So in this case, the most expensive node is doing a table scan. So 57% [00:21:00] of its time is there. And we can also check how efficient the table stand is doing by checking how much it’s pruning in the sidebar. So pruning refers to when the warehouse optimizes scans the source tables, how many partitions of data it actually needs to read, execute your query.

So in this case, we have about 37 million partitions and we’re scanning about 40,000 of them. So it’s pretty, pretty well. And so it’s probably not a problem with this particular source table with how this source table is partitioned. The other thing with checking is the memory spillage here on the sidebar.

So a spillage means that the machine doing the compute, it doesn’t have enough memory. So it might have to rewrite to memory to finish the computation, which makes it a little slower. In this case the amount of speed we’re really just looking at remote spillage here. It’s pretty small. It’s like less than five years, so it’s not really worth worrying about.

[00:21:59] Incremental materialization #

Vincey Au: So in [00:22:00] this particular case, it’s not really a problem with our SQL at all. Now, remember that an incremental materialization has two cool steps. So first it creates a temporary table and then it merges this temporary table into our existing table. And attends to that sometimes it’s actually the most that takes a really long time.

So in this case, all we have is a table that took an hour and a half to print the incremental table, which then took an additional an hour and a half to move into the existing table. And the reason why it takes so long really is because for each new row we want to insert. We also have to check the existing table to see if it already exists and only then do we add it.

And this process takes a really long time, especially when your existing table is like humongous. And beyond partitioning your table onto the unique key of the incremental build there’s not really that much you can do once you [00:23:00] have that much data. Which leads me to my brand new materialization we built specifically to avoid needing to merge and we call this inserts and all it really does is run the SQL and directly insert it into the existing table.

So for those of you who are seasoned,the dbt uses you might be wondering how this is different from the delete and insert strategy. And there are two key differences. So firstly we’ll still create a temporary table and then it will start the results of that into the existing target table. The second thing is that delete and insert still works with the unique key, which means you can still delete data from the existing table and avoid having duplicate data. Whereas with insert, you would end up duplicating the data. The trade off here though, is that you don’t need to search your existing data to see if it exists.[00:24:00]

And so a person comes again, it was really grateful. I really big search tables so we got to reduce the build time from something impossible to, we can actually put it into our daily build because there is no need for a merge, you can’t update existing records because there is no longer any emergent conditions and you’re heavily relying on your logic being correct.

And so we talked through a lot, little incremental strategies in this talk. So first we talked about how to handle late arriving data using the incremental materialization. And we looked at how to calculate the optimal lookback window and standardize this logic in a macro.

The second thing we talked about, what is this tricky situation of when you have a model that relies on multiple sources, which may not be equally fresh? We learned about the min-max solution and also the using that instead of that metadata solution. And both of these are really just small fixes that push the problem down another [00:25:00] layer, which may be enough.

And we also said that using source freshness is probably the most the ideal way to handle this. And finally, when your data is just too big that you can make the trade off of using. You need key checks to, and instead directly in certain data into your existing table.

[00:25:23] When to use these strategies #

Vincey Au: And so when do we use these strategies? So there are three situations when you might want to use these. So first is when you decide your warehouse costs way too much to do what it’s already doing. The second is when you’re growing so much that your data is so big that you’d have to use this. And thirdly, even if your your organization is much smaller, if your organization is comfortable with having close enough data, rather than like perfectly correct data, then you might also want to use this.

So we’ve solved a lot of problems that [00:26:00] have come up because of incremental models, but I’m sure I was being very more more and more on your problems and guarantee come off. So it’s a bit of a never-ending struggle. But hopefully with knowing some of these strategies that you can use and understand the trade-offs, you can have a better idea on how to approach your problems.

Last modified on:

Previous: Building an Open Source Data Stack

Table of Contents