Culture & teams

Data futures

Frameworks & best practices

Tool deep dives

Workshops

Coalesce 2021 Frameworks & best practices Modeling event data at scale

Modeling event data at scale

Paul is a software engineer, delivery lead, and engineering coach. You can find him usually building and designing SDKs and libraries for developers. He’s currently studying for a Master’s degree in Data Science, Technology, and Innovation at Univeristy of Edinburgh.

Data modeling is key to leveraging your user behavioural data to serve many data use cases across your business, such as marketing attribution, product analytics or recommendations.

However, this type of data’s scale poses challenges to modeling it effectively. This talk will explore the common challenges companies face when modeling hundreds of millions or even billions of events, and how to solve them. We’ll discuss the differences between the cloud data warehouses as well as incrementalization, performance tuning and testing.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

[00:00:00] Amada Echeverria: Welcome everyone. And thank you for joining day four at Coalesce 2021. My name Amada Echeverria and I’m on the community team at dbt Labs. I’m thrilled to be hosting today’s session modeling event data at school. Presented by Paul Boocock . Paul is the head of engineering of data engine at Snowplow. Paul spends a lot of time thinking about OSS, data, ethics, and privacy.

In his spare time, he can be found watching formula one on the weekend or out on long walks with his dog in the English countryside. In this 30 minute session, Paul will explore the common challenges companies face when modeling hundreds of millions or even billions of events and how to solve them. He’ll discuss the differences between the cloud data warehouses, as well as instrumentalization, performance tuning and testing.

Before we jump into things, some recommendations for making the best out of this session. [00:01:00] All chat conversation is taking place in the #coalesce-snowplow channel of dbt Slack. If you’re not yet a part of the dbt community on Slack, you have time to join now. Seriously, go do it. This is it. getdbt.com/community and search for #coalesce-snowplow.

When you arrive, we encourage you to set up Slack and your browser side-by-side. In Slack I think you’ll have a great experience if you ask other attendees questions, make comments share means or react in the channel at any point during Paul’s session. And if you’re new to the dbt community, or this is your first time at Coalesce, please don’t be too shy.

It’s important to hear from new voices. So please try men if you feel comfortable doing so. To kick us off our chat champion, David burden, the senior software engineer at dbt Labs has started a thread to break the ice. So let us know who you are, where you’re calling in from and share what scale of data you’re currently operating with and what it will look like in the future.

After this session Paul will be [00:02:00] available in Slack to answer questions, let’s get started over to you, paul.

[00:02:06] Paul Boocock: Thank you very much. It’s nice to be here today and I thank you very much for inviting me. So to I might not be the face that you are, perhaps all I’m expecting to see today. And you may have signed up to a talk by one of my colleagues unfortunately Will , hasn’t been able to make it today.

He was in a little bit of an accident yesterday, so I’m filling in at the last minute. So please bear with me. This is pretty, pretty fresh. In terms of material to me and, but I really still want it to shout out well, and for the amazing work you’ve done in putting these slides together I’m just going to reap the reward now, and hopefully I’m better than justice as I present them to you.

So I’m Paul Boocock. I would have been introduced that I’m head of engineering. That’s within the data engine function at Snowplow, and that’s really quite a broad part of the snowplow pipeline all the way from trackers [00:03:00] through to the core pipeline itself. And that does all the data enrichment and validation and including the debts models that we build on the other side of the pipeline of guests to activate on those use cases.

And when you’ve done that collection and with. 50 million. That’s a reasonable number. It’s quite a big number. That’s a pretty average size, I’d say of daily events that we see being collected in snowplow pipelines. And but when we start getting into these numbers of millions, potentially even billions we, start to hit some of the implications really, or that the issues that we, typically don’t see when we’re dealing with lower volume data collection.

So today we’re going to. Some of these implications of I’ve seen these large volumes and our houses are going to have to look at how we want to adapt our debt to modeling strategies in response to these ever-growing event volumes. And really let’s. [00:04:00] So we’ll, dig straight in. Now I think it’s important for us to start with really going back to the core of what we’re tracking here.

[00:04:11] What is behavioral event data? #

[00:04:11] Paul Boocock: And we call this behavioral events data. The events that we’re talking about today are really events that are captured when a user on a website typically, but not always website could be a mobile device, could even potentially be happening. Server-side with some sort of server triggered action maybe a transaction completes or something.

Each of these actions that are done by user or service happen at a particular time. And they have a corresponding state as well. Event occurs. So there’s usually like an event name might have some properties and often there’s additional states with each of those events as well. So if your examples here for instance, you may open an email, that’s a user action.

So the event. Open up email. And maybe there’s some extra context information. So we, capture events and we [00:05:00] call this extra context on an event. We say we capture extra entities for every event. So for instance, an entity might be whether an email has an attachment. If you do a page view, a classic thing on a website, we might be capturing things like the browser version, for instance and additional browser information.

They were tracking on a smart TV maybe a play or a polls or new videos loaded. And maybe we never getting around the UI of the app. We’re going to be capturing things like the TV’s phone, where maybe the operating system, maybe it’s like a Roku device or something. It would be catching extra metadata of that.

This will help us understand more deeply about what’s going on, not just with what the user’s doing, but what’s going on in with the extra context around what’s. And then maybe things on mobiles, like logging in with apps and we might have some geolocation information too in that example.

So not only do we have all of our core events building up to that number, we’ve also got all these like soap. [00:06:00] I guess he likes to have events all this extra context information that’s attached to all of these events that causes us to not only have. Sort of entity for the event itself, but all these extra attached pieces of information, which just increases the scale of what we end up seeing in the data warehouse.

So we have to consider scale even. So when one event could have 10 entities attached or something like 11 pieces of information that we need to think about and if you’re doing 50 million events a day, all of a sudden, maybe we’re at 500 million rows in our data warehouse. So that’s really where data modeling comes in.

[00:06:37] What is data modeling? #

[00:06:37] Paul Boocock: We have this role, atomic data, all of those events that are arriving into our warehouse. And the data modeling process is about applying business logic to that raw data, to take it to a place where it’s easier for analysts and all the other users of the data in the warehouse to extract business, meaning from that data.

Having the raw [00:07:00] data, these incredibly powerful, it allows you to trust your data. It allows you to quick often quickly run some sort of ad hoc queries, but it quickly as we’ll see shortly hits limitations. So this idea of data modeling allows us to wrap our business logic as well as have a number of other benefits that we’ll spotted a minute to take that, that raw data and make it more useful.

[00:07:25] Case study #

[00:07:25] Paul Boocock: So we’re gonna imagine a case study today with our, friends’ plant partners. And we can imagine here we’ve got. And unlike retailer, which is specialized in Southern house plants founded in, 2021. And so this year, and this is an e-commerce site. And, but it also has a blog.

And on my blog, they’re also producing about plant. So the articles about plant care, things like that. And then they’ve also got the recover side, selling plants, seeds, and gardening supplies. [00:08:00] So we’re doing 2 million gross revenue in 2021. And, but we’ve got these lofty growth ambitions. And currently we’ve got 50 stuff within our organization and so reasonable stuff size and probably quite classically, I’m a reasonably small data team.

But actually maybe impressively even with just 50 staff, we’ve actually got some data team involvement there. So we’ve still got a small data team. So some dedicated. We’ve implemented some sort of event tracking on the website which is allowing us to better understand that customer’s behaviors.

So it’s web tracking. So we think impaired views and clicks interactions on a page purchases or transaction or e-commerce transaction information and capturing all of that as well. And currently integrated into one outside is generating 20,000 events. But where we might be wanting to be.

So as the years progress, we want to be growing because we’ve got these lofty growth ambitions our day, the event volume significantly. I, if there’s one thing I’m [00:09:00] going to criticize Wells slide about it’s this graph. I apologize for the slightly logarithmic scale of some description. But hopefully you can see that the we’re getting we’ve got up to 50 million events a day in a few years.

So we’re going to have to start thinking more and more about our strategy on how we’re going to want to query. This is about volume as our company grows in maturity, when it comes to our data. So very typical first mock use case, then the people would do in any commerce website and will be something like moccasin attribution.

[00:09:34] Marketing attribution #

[00:09:34] Paul Boocock: So we’ve got this budget and our plan company to. To drive these customer acquisition and the data team has been optimized with with Taso with optimizing. And what we could do. And what many companies would start with doing is writing ad hoc queries to transform this event’s data into some sort of marketing attribution model.

So quite simply [00:10:00] here, it’s quite a simple sort of example, but we see we’ve got these raw events coming in. So we’ve got page views, we’ve got people clicking on links and we can see event ID. January thing on each separate event and we’ve got different timestamps and when they’re happening, we’ve got the same user ID.

So we can stitch these together on the user. And we’ve also got the same session ID. This might be something slightly more complicated than interject typically, but you get the idea here and then we’ve also got that marketing fields as well. So we can see where where were they where they were additionally initially attribute.

We then might aggregate this up to something like that. So this has not been done at the session level, so we can take this session ID. Easily see all of the same events which occurred for every session ID. We can take this time first timestamp or the mean time stamp and the max timestamp. And we can find the user ID that was all of these sessions had, we can then take the marketing parameters and we can aggregate [00:11:00] up how many pairs you events were there in the session.

And we can see, was there a conversion. And then we know that this user converted because that conversion have been fired. So we can take that and we can do that sort of thing. And that’s a pretty classic use case. So maybe every time we want to run the reports, we have this ad hoc query and we run it and we generate this, output and we use that to drive some self meaning for our broken.

However, as we continue on that journey our use cases are going to continue increasing marketing attribution is often the first, but it’s rather the last use case that a data team would come across. And that event there. It’s really powerful for building things across all of these different use cases.

[00:11:43] Increasing use cases #

[00:11:43] Paul Boocock: These are the three really popular use cases. We will see funnel analytics have used, has gone through certain stages. When they’ve been visiting the website, maybe within a session or within the lifetime of the user content analytics. So we have. [00:12:00] We have that blog that I mentioned where they’re publishing material.

They want to know scroll that how many people are reading the entire blog post before they leave the page and maybe do the next page view. And then another really cool one. And that we see quite a lot of people thinking about with snowplow particular colors, real-time nature is cost abandonment.

So we have people putting things into their cart and then they have a completing the transit. You may have visited a website where they send you an email, maybe 10, 15 minutes after you’ve left their website saying you’ve left some stuff in your basket. Would you like to come back and finish your purchase later?

And you can model up potentially every 10 minutes. You could look at the last window of events, look for people that have added things to their basket, but then never complete the transaction. And if so, send them an email.

[00:12:47] Issues scaling #

[00:12:47] Paul Boocock: So as we grow in all of these use cases, these ad hoc queries that we’re writing, maybe to solve some of those use cases, they become harder and harder for us to stay on top of the biggest issue. Usually as our use [00:13:00] cases become more complicated and as our warehouse tracks more data. I remember where a couple of years down our line outs are in the millions of events.

We’re going to start seeing reduced performance. So our query execution times might increase our queue times as in like we’ve only got a certain amount of queries. You can win in parallel on our warehouse. They may increase. We’re waiting longer to get a slot, to run our queries. And equally we might then start seeing increased query failure rates.

Where your failure rates could be caused by like memory limitations of our data warehouse. It could be a timeout occurring because we’re in the queue for too long, or maybe there’s a timeout limit or something, or just, it takes too long to run them. So the query would just times out when it succeeds.

And another thing that often happens, these queries are going to get bigger and bigger. So in some data warehouses what’s interesting here is the, these two different problems that. Exhibit themselves, depending on the data warehouse technology are using. So reduced performance is something I would typically consider a Redshift sort of [00:14:00] consideration.

We’re running on a cluster. That’s changing a little bit now is like Redshift new serverless and has RA three nodes, a little bit technical there. But we, still have to concern ourselves with the performance of the cluster, but less so about cost. We can scale up the cost of to get more performance, we’ll understand the cost implications that will have.

On warehouses, like Snowflake and big query. As our like table scans, get bigger as our queries get more complex and we’re looking at more data or we’re scanning over more data. As we track more data, our costs are going to increase almost per query, particularly in something like big query. And we’ve really got to keep an eye on what’s going on with our costs.

[00:14:39] Consolidation #

[00:14:39] Paul Boocock: As we also ad hoc queries, get more and more complicated and not datasets get more and more larger. So the answer here is to consolidate a lot of those four use cases that are described, have very similar core things that they all need to leverage. So sessions was a great example that a lot of them are going to be based on session.

[00:15:00] So cart abandonment, when a session expires and the user has not done a purchase, then we’re going to send that email. We looked at attribution that was aggregated at the session. Again, you might want to do content analytics at the session level. All of these things have the same sort of session aggregation as part of what they need to do to output the results.

So if we can aggregate those raw events up into different levels, we can significantly reduce the amount of data that we need to scan to actually look at use cases. So page use is 20% here as a typical size of going from raw events into an aggregated page use table, 6%. If we then aggregate them to sessions and even less, maybe even just 3%, when we get to the user’s table.

[00:15:46] Incremental models #

[00:15:46] Paul Boocock: Now we can take that one step further, creating those base aggregations saves as can save a huge amount of cost, and it can also serve as a huge amount of time. But how do we actually build those tables? So is that raw events table grows [00:16:00] the aggregation into page views, sessions and users also grows and gets bigger and that query takes longer to run and gets more expensive.

So if we there’s two ways to do this, we either drop and recompute, or we build an incremental model. If we drop and recompute with falling away, those three tables every time. And we. We then end up rebuilding those three tables. Every single time that’s going to become slower and slower as we get more and more data in that atomic role, a table at the bottom.

However, if you can think of an incremental model where we only try to add the data that arrived since the last time we run it, then we can basically solve all of those problems and our. The size of our queries is becomes relatively stable. It only become the only grows in line page views how many new Peggy’s we have for each one.

We’re not having to recompute all of history every single time. [00:17:00] So hopefully you can see this cult. I’ve tried to make it as big as I could fit on the slide. And please tell me if you count and Aila pop my screen-share may come off to help them make it a little bit bigger, but a dropping me compute model.

And every time we just drop the table and then we do our computation. So just a really simple example, we select the pairs, you and a lot of other properties. And we’re going to do an aggregation, looking at the page view ID, looking at, sorry, how the number of the ID of the page view based on the session.

So this is basically by the index of the page view within the session. So I’ll give it as a count at page view ID and when it happened and we’re going to order it by the timestamp. So we’re getting pairs you in session index and from our events table where the event name is page view, we can just drop him your computer every single time.

Alternatively, we can build something a little bit more complicated. We can look at an incremental model and the big difference is this bit’s relatively similar down here, [00:18:00] but in the middle, we’ve got this with sessions with new events and the name gives it away. What we’re looking for here are any sessions. that have seen a new page view since the max last timestamp that we’ve seen. So we look in this table and we check the largest timestamp that’s currently in this table. And then we only include sessions that have had page views greater than the last time we’ve run our model. Hopefully that makes sense.

We can then join that. Into our model and it limits down the number of pages that we need to reconsider.

[00:18:46] Process the least data possible #

[00:18:46] Paul Boocock: There’s a few of the things to watch out for here. And we’ll take a look at the CTA sessions with new events that I showed you a second ago. We want to ensure that this filter, this incremental filter is actually [00:19:00] on the partition or the psyches of your sauce. So big query, we have partitions.

In Redshift, we might we’ll have sort keys. Now actually a lot of snowplow users don’t use derived timestamp as the timestamp of that, sessile key, potentially a better timestamp is actually to use collect timestamp because that’s guaranteed to be in a better order. Derived timestamp can change a little bit.

I won’t go into what the right time timestamp is. But it’s not as consistent as the collector timestamp. So switching that out for collected. He’s often a safer bet when it comes to the crunch though session, you will miss less events in your incremental model by using collector timestamp.

We also want to consider restricting table scans. So a warehouse, I big query. We’re going to get more and more data in our scan as time goes on. So we’re still running from the events table. And we [00:20:00] don’t want to. All the time. So instead we try and limit our tables comes. So this section at the bottom here.

So in this example here, we’re just looking back three days. So we’re only going to include new sessions that have to had pairs use within the last three days. This will limit the amount of data we need to scan back over, to check for new pairs use within that session. Before we then reprocessed them.

This is a huge cost saving. And this number here, this is something you can configure depending on your website. Like how often do people revisit and things like that. So you can, some people that will not as a week, some people are just one out of the day. It’s a huge decision to make, but it can have big implications on the cost of your models that you’re running.

And maybe even the performance. And lastly, and I’ve touched on this all the way through understanding your warehouse. So understanding exactly how you set up your sockies, what your tables are partitioned by understanding the cost model and the [00:21:00] performance model of your warehouse, Redshift bakery, Snowflake typically the three that we will have that we see and maybe even PostgreSQL, and I was very particular performance characteristics.

And as often useful for low event volumes. But really understanding what’s going on there. It’s a little bit out of scope for this talk and for me to go into too much detail. But yeah, getting to grips with the intricacies of your warehouse is really important. And we’ll see that in the snowplow web model for dbt it considers the different warehouses in terms of like how it wants to run and execute across the different warehouses.

[00:21:38] Performance tuning #

[00:21:38] Paul Boocock: Now there’s little things you can do. In terms of performance tuning. Some of these might be obvious to those of you that have been writing SQL queries for a long time, maybe less. So if you need to this not joining to yourself is a classic antique button that we want to try and avoid and sharing.

You’re picking the right partition and sort keys. And so I talked about derived [00:22:00] collected timestamp. That can be important if you want to make sure you incrementalized model, does it need to be like dropped and recomputed every now and again that you can trust it on an ongoing basis. And remember that look back windows going to potentially miss some data, but it’s going to capture the 99.9%. Hopefully of those pages. It might need recomputing for instance and be aware of the cost of what you’re running. So this is an example. This is really, accurate. But very expensive as I think, and often as I was query, sorry. Because he’s looking back.

And over, it’s a particularly expensive way to find out this first value. If the user ID maybe less accurate, but a lot cheaper is just to look for the max user ID. Okay. We’re always thinking about like how, much are we scanning? What’s the performance characteristics of this query?

Now that’s sometimes how to spot. And a lot of that comes with experience. [00:23:00] And a lot of that comes with using like debugging tools that are gonna, they’re going to come with your warehouse, looking at the cost profile and big query in particular. You can see the, price of your query. So considering that, and then optimizing your query to try and lower that down by maybe being less accurate, but approximately enough can often be one step to achieving some like improved performance or cost performance.

[00:23:27] Choosing an upsert strategy #

[00:23:27] Paul Boocock: Another thing you can do is choose the right upsert strategy. So I won’t spend too much time on this. But ultimately you’ve got your target table. You’ve got your salt, your size stable. And we’re going to try and end up with a new resulting table when we’ve had this incremental rub. So picking the right sort of strategy to ensure you’re upsetting your table with this new information, and we’ll see that you see that they both have see here.

But we want C1AA to be what ends up in our resulting table picking the right strategy to achieve that. Can also yield additional performance gains. [00:24:00] So in dbt, there’s several native strategies for you to pick from, and we’ve got the merge strategy, very easy to implement updates and then inserts.

But it’s a reasonably on performance upsets. I think that’s pretty typical for most things, easy to implement a little bit of performance, or you can have the insert, an overripe technique, a very performance upset but replaces and tie partition stuff. Or perhaps we could even have the best of both worlds.

And so if we look into the snowplow model and we’ll see that we’ve tried to balance this and with a new type of sort of incrementalization based on some of the principles that I’ve, briefly spoken about today. So we’ve tried to limit the types of scans on the destination table. And out of that insert add between reducing the update and insert technique to ensure that we only process at the least required data.[00:25:00]

[00:25:01] Current incremental architecture #

[00:25:01] Paul Boocock: What does our in the snowplows? So now there’s a Snowplow web model for dbt, and we’ve taken a lot of our learnings that we’ve had over the years and we brought those into our new incremental model. And so we’ve got. This architecture now that leverages all of those learnings and allows us to build something that’s wealth performance, and considers the cost as well.

And when we’re running in different warehouses. So some benefits here, so this incremental model that we built and now they get into it in a little bit more detail in the next slide. It’s very easy to increment to. It’s fully incremental, takes all of those learnings into, it. And isn’t going to throw away data besides that sort of scan limit that I’ve described.

And it’s all done in native dbt. There is a few cons we are going to spot the same data again. So we’ve got to identify what that is, which can be a little bit, intensive. [00:26:00] We’ve got, and this case we’re only looking at Petrie events. And it is still very tricky. So I mentioned a lot of the, you might want to build these adjacent use cases off of some of these tables.

It’s quite tricky to building additional incremental models off of these derived tables. Just because of some of the complexity that’s happening around here. Not impossible. But yeah, you really got to understand what’s like going on under the hood. Hopefully I’ve touched on some of that today.

And, but diving into the Snowplow web dbt package is a. I hopefully it’s a great learning opportunity for a lot of people. And it’s obviously all open source, so you can go in there and, take a look and see, I would solve some of these problems in this relatively complex space where we’re dealing with millions of events.

[00:26:50] Redesigned incremental architecture #

[00:26:50] Paul Boocock: So, that’s what we had before and that’s the core, right? So we still got this and, it was all incremental. And, but now as we. [00:27:00] The extra complexity now comes in when we start looking under the hood a little bit further and we see that we’ve got this row events table. And what we now do is we calculate the events, this run.

So these are all the events that we’d need to reprocess. And we store the lettuce timestamp that we saw and we use. Basically on every event to check. So I mentioned the max timestamp thing earlier. That’s basically how we do this. We basically got this metadata table over here. And then we take these events that we need to process.

We drop and recompute these yellow tables. So we then look for the pear juice that of a pod, this run, and then we use this upset strategy to upstate into page views. So rather than trying to figure out all of the incremental logic like we previously did here. So same fundamental stuff, but we’ve introduced these additional tables to, try and simplify some of this and reduce the query and make it more performance by generating these [00:28:00] dropping me computes to make.

So we’re not reprocessing too much information. We then take the pairs use, we generate sessions. We upset that. And then we actually use the session information to build a user’s this run. We, use that derive table. So there’s an example here of using one of the derived tables to. And another derive table.

So we end up with pages, sessions and users, same tables but helping us to hopefully make it a little bit easier to, take part in this architecture, there’s just points of extensibility now in here, and that maybe previously were harder to, spot with the previous incremental architecture.

However, this is pretty complicated now. So in terms of. So whilst this X sensibility points in, from our point of view to men tend, this is actually pretty tricky. Luckily for you, you just get to reap the rewards of this architecture. [00:29:00] But yeah, there’s a it’s maybe if you want to like, contribute or get involved in.

[00:29:04] Summary #

[00:29:04] Paul Boocock: And so this section is pretty, it’s pretty gnarly, but I do, if bit of a dbt power user, go in and have a look at this model, I’d hope that you find it particularly thinking of. So in summary, I think about buying on time. This event data can solve a huge array of business needs. They were just four use cases.

We did a little exercise a while back related to by 39 different use cases that were pretty popular and common to use behavioral data. But a lot of them do require this aggregation and building this set of derived tables that allows an, a lot of them have got commonality. So those drive tables and building those out are really useful.

We’ve got separate models for web and mobile. The mobile one should be landing in dbt pretty soon. And, then beyond there, we can think maybe of more specific use cases as well. So content analytics, e-commerce video, things like [00:30:00] that. And hopefully. If you have an idea that at some point drop for me, compute just doesn’t really fly anymore.

You’re going to get to a plane as your company grows and where incremental processing is something you need to think about something. You definitely need to do things to always strive to do always in the back of your mind, whatever you want. Minimize the number of queries against your events, table, process, the least amount of data possible and optimize your queries wherever you can get peer reviews.

It’s amazing. How just because of the experience we each have this a little bit different, we know these little tips and tricks. So yeah, always getting those peer reviews really, Make sure you pick the right upset strategy for your needs and for what you want to achieve and where you’re at as well in terms of your, data maturity, and then consider what incremental architecture that is best suited for you.

There’s a number of ways of doing this. We’ve gone for quite a complex approach, a simpler approaches as well, in terms of the way you can build an incremental model. And it [00:31:00] really it’s a bit of a maturity curve, right? You’re probably going to get to a complicated place eventually, but you don’t have to start there.

You can just go on that journey.

And that is everything for me today. So I’d like to thank you all for listening. I hope that was okay. I found out this morning, I was going to step in and do this. And so I hope you enjoyed that. I can see lots of questions in Slack. So I’m going to head over there now and I’ll go and start answering some of those.

Thank you very much.

Last modified on:

Previous: Building a metadata ecosystem with dbt

Next: Introducing the activity schema: data modeling with a single table

Table of Contents