Table of Contents

Introducing the activity schema: data modeling with a single table

Discover more about author and dbt enthusiast Ahmed Elsamadisi.

Ahmed will present a new data modeling approach called the activity schema. It can answer any data question using a single time series table (only 11 columns and no JSON).

Instead of facts and dimensions, data is modeled as a customer doing an activity over time. This approach works for any business data used for BI.

Follow along in the slides here.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

[00:00:00] Amada Echeverria: Hello everyone. Take two here. My name is Amada Echeverría and I’m on the community team at dbt Labs, and I want to welcome everyone and thank you for joining us at Coalesce 2021 live and un-muted I’ll be your host for today’s session.. Introducing the activity, schema data modeling with a single table presented by Ahmed Elsamadisi. Ahmed is the CEO of Narrator and has spent his life obsessing over how people and machines make decisions from building AI algorithms to help block nuclear missiles for Raytheon, to getting Wework sales org to understand that month over month metrics are terrible.

He is here to share with you his solution to data modeling and is very well aware that claiming that a single 11 column table to answer any questions, it sounds like BS his words, not mine. And more on that head. Recently, Ahmed was recognized as one of Forbes, 30 under 30 in the enterprise technology category.

And this 30 minute session Ahmed will [00:01:00] present a new data modeling approach called the activity schema. He will cover structure how it works in practice, how it makes your warehouse cheaper and much faster, how it really can answer any possible data question. The list goes on.

Let’s get started over to you, Ahmed.

[00:01:16] Ahmed Elsamadisi: Awesome. Hey everybody, I’m in here. And I’m really excited to share with you guys about the activity schema, single table approach to answer any. Just as Amada mentioned before this, I built that we were data team and infrastructure. So there’ll be some stories from WeWork and before the sword from the Raytheon as clergy.

Yeah, I engineer on missiles. So I’m really excited to share the activities and how to answer any question or really push everyone to start posting the questions that they have answered in the dbt Slack, because. W I’m going to try to make sure that I touch upon a lot of these questions and I’ve had a lot of experience proving that the activities give them.

I can answer questions since I’ve been doing this for years. So let’s like [00:02:00] make those questions really hard because I know that this audience is skeptical and are going to put us to a test. So let’s get started. So activities really started with the idea of a single source of truth because every company has gotten in a circle and said, let’s build that single source of truth.

This has happened many times and we all love it. And building a social truth is actually easy, right? Just make sure that every single piece of data is in one data model and not duplicated across all the data models, put your BI and visualization tools on top of those single source of truth, data models.

[00:02:35] Reality is not so simple #

[00:02:35] Ahmed Elsamadisi: And you’re done. And this approach, every work talk would be done now, but this approach doesn’t and want to share a story about why it doesn’t work, because the reality is really complicated. I remember at WeWork you probably might know him because he’s like now as TV star from all the crazy TV shows and documentaries about WeWork’s stories.

But Adam once brought me into the [00:03:00] office and he was really frustrated. Marketing was reporting data from sales. I knew he was having a total sales numbers be a number, and he would show me the report from the sales team. And it had a different number for total sales in February. Which one is correct.

Then he goes into panic mode. Guys, the numbers are all wrong. Data’s all wrong. Everything is all wrong. Everyone here has already experienced a case where one number not matching in one of the thousands of dashboards that we’ve built causing a lot of panic and stakeholders to go. Everything is wrong.

So why does this happen? And why is this question? Really? Actually a lot more nuanced than we think. Let’s take an example of how it, what it takes to build a single table. I want it to do a marketing visitors to sales. This is an example that everyone has experienced because these classic e-commerce situation, I want to know every time when somebody visits the site, whether they came from an email paid organic, and I want to know from that site visit, did they actually order seems very [00:04:00] common and all kind of tackle it.

Very similar. How do we do this? The first thing we do is we have our web session data and we have our sales data sessions coming from your snowplow or your segment and your sales coming from Salesforce or your Shopify. And you really got to combine it. So we’re going to use some nutrient parameters, hope, flea, that UTM parameters don’t duplicate, and you’re unique and they’re maintained.

Then we’re going to look at our emails and we’re going to bring the emails. How do I touch an email to what session? Maybe I’ll use 30 minutes. Activity 30 minutes of opening, more than 30 minutes, and I’ll dedupe it using not run number. So I can start creating this table. I’m sure that we’ve all kind of created these tables that have joined and done these really fuzzy joins.

And you can imagine these SQL queries get really complicated and really messy, and this is just one table in your warehouse. You were probably doing this thing like hundreds of time, and most tables have dependencies and layers of dependencies and layers and layers. [00:05:00] So this gets incredibly difficult to manage and it continues to grow our proportion.

What if we can make it simpler? What if actually the tables that we used and the joints that we used were actually the same. I know this sounds a little weird, but let’s go along with me for a second. What if each table that we had look the same, they just all have the same exact columns and the same exact structure.

And instead of these joints being very unique per single question and how we want to combine it, what if all the doors were just based on the customer? This would be really easy. You grab sessions, we’ll know when they happened and who happened to bring in the sales? If it happened the same person, whether the same session, bring an email.

If it happened right before it, then we can recreate this table. This approach is a lot faster and seems to be a lot simpler. Let’s take this further. What if we. All these tables, all [00:06:00] these little similar tables and make them into one table. Then we have one table and we’re just self joining based on customer time.

And then we can create any table we need. When the new question comes in, we can actually do this again and create another table. When you need a visitors to sales with emails, that single table that has your emails, session, cost data all together, self join it in that table. If you want to add, if the customer called us after the session, you can add that right away.

If you want to just do quarterly sales, you can do that right away. It’s all just a single layer of dependency. If you’ve done this before, please tell me the different tables you’ve created. Why you think that a single data layer will not actually work. So we call this single table, the activity. And we call each one of these time-based operators relationships and together, combining the activity, [00:07:00] schema and relationships, you can actually create any table you want sounds magical, but there’s a lot of value in that.

[00:07:08] Separate models and querying #

[00:07:08] Ahmed Elsamadisi: So the benefit of the value of this activity schema is that all the data that you care about is in one table, each one of these activities, you’re going to see it in a second. You can define independently because everything is coming from the activity schema. You’d probably have to worry about numbers, not matching many complex dependencies.

Everything’s really easy. And because you’re really just self joining to create the table when you need it, you will never need to go back and change the activities. You can have a defined them independently of how you use them. And this creates a really nice world. It creates single source of truth.

Actually everything is coming from one table. Numbers will always match. It’s super easy to create this table. And it just makes a lot of things faster. It’s a single [00:08:00] tables that have hundreds. There is a lot of challenges with the quality. You have the right to use it. We’ll dive into that later. This is activist chemo.

[00:08:07] Activity schema - a single table for all your data #

[00:08:07] Ahmed Elsamadisi: It’s a table of such a fund customer action. And time you can take your raw data, bring into this table and put all your materializations, your plus everything you want to do on top of it. So for example, here you completed order might come from your Shopify data and your email might come from SendGrid and MailChimp, and it’s super easy to just union and get altogether into this single.

And we often get this question just does it have to be a person? It doesn’t seem customers do company streams. These activity schema for scooters at WeWork. We had a property to see everything that’s happening at time. It’s really resembled how we all did people on not difficult. Talk about what’s going on in our business.

We talk about our core entity, the person, and they’re coming in ordering the viewing pages are acting they’re behaving, and we can slowly see how the. Evolve [00:09:00] now writing these transformations is really simple. You have to take the raw data from the sources, bring it into its unique. 11 comms, simply map it to those 11 columns and then you have activity schema.

And if you notice here, these activity, commerce are always coming from just very simple quirks. I know. What about the common identifier you’re telling me you want to get everyone to have a common identifier? No, that paid views often don’t have emails. They should have cookie and tickets might have submitted or IDs and emails might have emails, but all these sorts of it.

And we’ve thought a lot about this problem. You can read a lot about it, but the activities schema and their reader use a really key identity resolution approach with. It’s a whole talk of its own, to be honest, but it does multi-user multi-device in time recursive, consistent identity stitching to create a really unified customer.

So for now, just assume that we will able to do it. And if you [00:10:00] have any questions, it will be perfect there. So we’ll dive into how those how that customer is stitched on the dbt Slack. So feel free to post questions on that. We’d like to transformation. So we like to think about these transformations are very different.

It’s still sequel, but they’re very different kinds of sequel. It’s super simple. Most of our customers actually write these transformations in 14 minutes. And the average around 25 minutes, we’re not talking about thousands of lines and dependent on tables and tables on tables and tables.

We’re talking about very big. Query and all they have to do is define what something is. You often see that the definition is so simple. Most people just use that as a judiciary. You want to know exactly what the completed orders looking at 29 SQL query that represents the mapping of the data.

And because these things are often simple, there’s no conflict zone. You’re not really bringing multiple systems, just using this data. And combining it into activity, schema ends up being really fast. So you’re getting this really simple approach in a big speed. So hopefully so far, we’ve seen that you read the small, tiny [00:11:00] snippets.

They map into these building blocks. All these building blocks get union into this activity schema table. How do we use it? So crane activities is pretty unique. I like to think of it as like fancy pivots or like clever self domains. So you have this long table and you’re going down the long table and you’re using that to answer questions.

For example, the question we tried to answer earlier, which is want to know everyone visited the site. Did they have an order? And what the conversion rate from the site is, we want to note that by month, it’s actually pretty simple. We go down the activity schema and grab every single one visit for every single visit.

We just look did that same complaint order before the visit it again. If yes, I’m going to put a one, if no. So now we have table every visit and whether the person ordered or not, and simply I can count the number of rows in there to get the total number of visits. I can send the total number of been ordered and get the total number of orders.

I can just average it in order and [00:12:00] get the conversion rate you’re simply going down and it’s super easy. The good news is way of going down. The data is what we call relationships. It’s a time-based point. And Jay said in the slide. Everything ends up being 11. So we have 11 relationships.

So the 11 relationships are really reflecting on the way stakeholders and people talk about what’s happening. We’ve seen these relationships happen. Naturally still you’re looking at the first time ever a customer visited. Like I just want to know when they first started. You might want to look at when someone’s first after or first between or the last time before they ordered the last time they updated their contract.

And I want to know what that value of the contract was. So these little building blocks of these relationships can be used with this activity schema to really generate any table. And I would encourage people to go through and put some questions that they think can be answered to some tables that they need that can not be replaced [00:13:00] with time-based customer operators because we’ve been doing this for awhile and it turns out you can replace almost all the forties just by using these time-based operators. But let’s see how it actually works in real life. Like click. I want to go through a bunch of a couple of examples to show you how simple questions that complex, simple questions that as data people we know are complicated can easily be answered in this approach if you want to email attribution.

So what’s the best campaign that gets people to order for your email. So we have email in the campaign that this email’s been part of and did that email it to. This is a complex question, because you can imagine having to deal with clicks and UTMs and dah, and a whole lot of things, but a narrator, a user activity schema, actually super simple.

You just say, give me every open email and give me the first time in between those open email, the customer actually completed an order. That’s it. You instantly get a table that has every single email, the campaign with the order information, right on top. [00:14:00] One another, the best paywall to get subscribers, easy, grab every person’s subscription and see what the last, before paywall, they viewed just by looking at that customer behavior, we can easily understand how these things can generate these tables and from these tables and we can aggregate them, analyze them and do everything we need.

Another example is how does it take for someone to first come to a site to become. Here. We just got the first time they ever viewed the website and give me the first time after, if you do a pet that they submitted a lead and I can see how long it took to get there when it did and how many people actually submitted a lead and the conversion rates notice that it’s all these questions that I was answering.

We never thought about, Hey, where’s that viewed website data comes from, Hey, what is submitted lead? Do you see more building blocks that you’re just starts learning and using? And the language that you speak in. You don’t care about data source anymore view [00:15:00] website can come from seven different data sources to cover multiple.

We’ve seen actually many customers transition from like HubSpot to Salesforce and everything they’ve been doing is still submitted a lead. It doesn’t matter that it came from two sources, one or many different changing sources. It keeps it really simple. And we’re constantly speaking about things in time.

We never have to deal with these unique question of how do I, these two things together. This is a nice UI that shows you my name’s data, but how does it actually work? It turns out when you go to write the query to do this approach, which is why this approach isn’t really very common and why a lot of people don’t use activity schema is because writing is actual craze to do these time-based joins is extremely complicated.

You can imagine having to, if you want to do this first in between relationship. With equal, you would have to figure out when the next feed pay happens, then figure out training the complete order. Then did you pick and choose the first one instead of the, if there’s multiple count, them [00:16:00] all deal with what happens if there’s no deal with put a bunch of window functions to make sure that you don’t dupe or you choose the right one.

And if there’s no, you handle it. And it becomes a really messy query. Luckily for you, you don’t have. The whole point of this UI that I was showing earlier is that we can generate those questions for you. Those queries for you, you ask the question, you assemble those things, using the activities and the relationships.

And we will, auto-generate a very efficient query that runs on your warehouse that is super optimized and super fast. And when I say fast, it really means fast because there’s really. So summarizing kind of everything we’ve learned so far is that with this activity schema, you can bring all your data into a single table.

[00:16:47] What we learnt #

[00:16:47] Ahmed Elsamadisi: Really represented the way that us humans talk about ask questions and answer them where every row is a single activity happening to customer time. These things are super easy to write and super easy to debug because you just see this customer journey [00:17:00] and you can put everything you want on top of that activity, schema.

But by being able to combine the activities with these time-based operators relationships, you are able to generate any table, the guarantees, the numbers of that match. Cause it’s the same table. It makes it really easy to add more columns from different activities like dealing with slow changing dimensions or all these different aspects.

You can easily pull in feeds from other activities and everything can be broken down into this human readable format that is these building blocks, how they relate. Top of this whole thing, but having a single table, it is a very small table. Like a billion rows. There is like the gigabytes on your warehouse.

So you can scan this table really quickly. You can process the vertically. Everything is incremental. There’s a whole lot of additional values on your warehouse and time and costs. When you say for my district used to be initially implemented as like a time and money saving on warehouses, because they’re super optimized for this long tables.

So not only are you getting that, but you’re also [00:18:00] getting just a lot factor experience.

So then what, like when this whole thing happens you get a lot of. Changes in data and it makes it really exciting. If everyone used activity schema, you’re going to get a lot of benefits because every company is using the exact same table where they expect same structure.

You standardized all of data to give common guarantees and assumptions, and that allows you to share and review. For example, like analyses and algorithms, top of activity, schema can be used for multiple companies because every company has the same guarantees integrations you can. Now everyone can integrate with activity schema, which is a standard structure and really generate anything you need for automatic integrations with any of the common data sources.

So you really can create this world where things are shared a Salesforce, different sales. You can do this, or you can do, you can activity. Schema is doing it for. And I’m sure all the data scientists listen to stock are excited. Cause we spent, we write [00:19:00] algorithms are like bane of our existence is what was the assumption that guarantees that we’re getting to make sure that we understand the assumption that guarantees of the input so that the output behaves this is what activist activity would be in a, is this com input, which allows you to really create and reuse analysis.

And honestly, we see that daily with our product. Our customers are actually able to generate analysis in minutes by using automatic templates because of the activity schema standardize. We’re able to actually answer all these complex questions instantly just by, because of the standard structure. And every one of the worlds was to hopefully be able to do the same thing as more people adopt the activity, schema simpler data modeling incident analysis is just a better way of doing data.

[00:19:42] Q&A #

[00:19:42] Ahmed Elsamadisi: I hope really I got to, it really touched upon what makes that gives him a really special and how it works and all the building blocks. So if you’re really interested more into the activity schema and narrator, you can check out a website, you can reach out to me on dbt Slack. I’ll be here and I’ll dive into these questions and let’s get started with the [00:20:00] questions because a couple of the most common questions that I’m going to try to answer first is what happens if I have more than I can use. So activity schema is really relatively simple in querying, as you remember, it really just maps with three core features. So this is a very common question is two people have really just haven’t used activity in a lab. You can definitely imagine a world where you will. We call enrich the activity schema by creating a join on the activity ID and adding as to the features as you want.

The emergency case, but what we see is 95% of our activities used at, they don’t need three features because you can only pull in the features related to that activity. So for example, if you look at any of your tables, you might have like first time they came to the site last year, 10 parameters when the person converted all these different actions, you can just pull them in when you need to using these kind of building blocks and pull in [00:21:00] the first time they signed up.

Easily, because you can combine all the activity, you you can borrow the features from different activities and it turns out then you actually don’t have that many unique features per activity. So we see this always happening where you might want a bunch of dimensions, but it turns out because you can borrow the dimensions from other activities.

You’re just assembling the table you want instantly and fast. I hope that kind of addresses. The question of what happens when you want more than majors and most of the people that actually try it, you end up realizing that often you should only have three features, to be honest, it’s just because you get so pat in the habit of kind of keeping these activities simple, you can really just pull in anything as you need.

The activity schema is an open source project that you can learn. Activities can come in there. We talk about the how you write these kind of quirks that we see here. So we really introduced the way you query the activity schema and what these relationships [00:22:00] are and how they actually conceptually work.

The actual tool that we built on top of the open-source project is not open. The UI is what we is just to make that process faster. So we actually share a lot of a lot of our core approach and how you deal with activity, how you create activity, schema, how you use it, and how do you query it all in our open-source spec in get hub.

But when it comes to the nice UI, we’ve built that in make that process simpler. We found that a lot of our customers that want use activity team find that narrative. I just didn’t motivate a tool to make using it, creating a using activity schema just a lot faster.

So the. The question about activity stitching. So how do we understand, how do we deal with the fact that we get information later? Also the part of that could be schema is you’re not just inserting data you actually doing after you insert all the new data, you run a couple of [00:23:00] updates that processes the data in the past.

So there is a whole multi-user multi-device identity stitching that happens where you can actually go back and update the schema to fill in. Customer identifier as you learn it later. So there’s a lot of those unique features. Like we’ll talk about that in, I think a narrative doc site, it has a lot of these kinds of special case activities that you can use with the activity schema as well.

Like what happens in suppose fraud would happen with someone later gets identified. You just simply have a special case activities, and then you can go back and be activities and constantly be stitching that user to create a really reliable global identifier. Because activity schema is always incremental and the way it updates, which we talk a little a lot about in our spec, it is super optimized to now be fast.

You’re constantly inserting new data, which is only based on the most recent data. You’re not really just regenerating the whole thing and you can constantly stitch that identity and update those set roles. So [00:24:00] that’s ended up becoming very cheap, very efficient, and it keeps it all within one table.

So not really just reprocessing everything. You’re just working off this one.

Yeah, yeah. Customers doing this thing on that. Yeah. We see customers creating beauty building blocks for each transformation and putting them into active schema. We’ve seen customers. Really, I think Fivetran did it because it was project to tackle also making sure. Entirely integration of certain data using dbt .

So there’s this really nice just harmonious world where you can define these building blocks in dbt and combine them to create activity schema. The, we haven’t been able to actually do the full auto generated queries in dbt, but as needed to learn how it works, you get the habit of how to use it.

So we’ve seen that happen a lot. I identity stitching question. I will put a link in there to help you see how it [00:25:00] instituting works. It is the core part of two. The activities, what works, because if you’re going to build a table, has that’s been around customer and time. You better make sure that your activities scheme is as detailed as possible.

Are your customers as complete as possible? So it’s a lot more nuance was actually one of our more complicated pieces of a thinking process. We’ll actually add that in how to help really dive these things to.

Another thing, just to mention, because activity schema is all happening in time and you can use the flight. Last, before we did, this is really also very helpful for slowly changing dimensions as well or changing facts. So you can imagine you’re a bank and you have like total assets under management for each customer, and you want to know.

Someone opened the app. What was their asset? Central management. You can just say, give me every time the customer opened the app. I can give me the last time before they updated their assets under management [00:26:00] and get that exact total revenue. So if we just slowly changing dimensions, slowly changing facts that are pulling them in because of the time series nature, you can literally not work.

Normally think about what, how, when this happened. You can literally pull in the last time, something like that occurred. This really lasts before. First of all, I think dimension is a lot of times where buy data, people really love narrator is just and that just came out because these kinds of questions take a lot of time to do SQL and they’re really confusing, but you can just define them independently and bring them into the table as you need them.

Going through some questions. Yeah.

So as we go through more and more of these questions really just hope everyone gets a chance to try it. I think that getting experienced activity schema more often changes how people think about [00:27:00] data. You start realizing that so much of what we do is so repetitive and so much of what is exciting about data.

It can be like abstracted away and we can iterate. It’s really fun also to see customers that have transitioned from one company to another, that they use activity, schema, where you can go into a new company and see all the dates. In the structure of the data being the same from one company to another, and you can instantly pick it up.

You just got to learn what activities mean and then makes asking and answering questions. It means it makes answering a question with data just as easy as asking it. And that’s the beauty. Like we tell our customers, if it takes you more than 15 minutes to answer a question with narrator, you’re probably doing something wrong.

Everything should be that simple because activity schema is in a way of thinking. We put all these constraints in place. So that the only you’re thinking about is how do I think about the customer journey and what am I looking for? And then I can easily assemble it. And the more people that do that, the faster, we’re all able to answer questions.

The more we can make correct [00:28:00] decisions. And I think. The world. I, if anyone has any more questions, feel free to post it, but like the proof is in the bank. I haven’t been saying this thing forever. I think that if you’re ever skeptical activity schema, please try it. And let’s see Let’s just see an action, because if you can find questions that we can answer with activity schema, like proven by contradiction then.

Great. But in the last five years, we haven’t really seen any question that activities schema cannot answer. And I’m excited to have you guys all experience that.

Last modified on:

dbt Learn on-demand

A free intro course to transforming data with dbt