Culture & teams

Data futures

Frameworks & best practices

Tool deep dives

Workshops

Coalesce 2021 Frameworks & best practices The Operational Data Warehouse: Reverse ETL, CDPs, and the future of data activation

The Operational Data Warehouse: Reverse ETL, CDPs, and the future of data activation

Arjun Narayan, Ashley Van Name, Calvin French-Owen, Rachel Bradley-Haas, and Tejas Manohar

Arjun was previously an engineer at Cockroach Labs. He holds a Ph.D. in Computer Science from the University of Pennsylvania.

Ashley and her team have been working with dbt for the past several months, and have used it to give their enterprise data warehouse a complete makeover. She loves working with dbt and believes it is an essential ingredient for building a modern data warehouse.

Formerly a co-founder of Segment, Calvin is currently digging into energy generation, data visualization, education, AI, and Tools for Thought. In his free time, he runs various trail races and hacks on a few side projects.

Discover more about author and dbt enthusiast Rachel Bradley Haas.

Tejas is the co-Founder of Hightouch, the leading Reverse ETL platform. They sync customer data from your warehouse into tools your business teams rely on.

For the last few decades, the data warehouse has mostly been seen as a platform for reporting and analytics. But as the center of gravity for data shifts towards the data warehouse, data warehouses are increasingly being used for operations as well.

Operational use cases for the data warehouse range from predicting flight delays in near real-time, to equipping sales and customer success teams with contextual customer data, to powering in-app personalization.

In this panel, we’ll bring on an assortment of thought leaders and data practitioners to discuss how and why companies are using warehouses for operations today. We’ll also touch on the challenges in operationalizing warehouses, and discuss future technology advancements that could unlock the warehouse to become the de facto data platform for operations.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

Amy Chen: [00:00:00] Awesome. Looks like we’re live now. Welcome everybody. I hope you have all been enjoying your first day of Coalesce. Now it’s time for The Operational Data Warehouse: Reverse ETL, CDPs, and the Future of Data Activation. My name is Amy Chen. I’m a senior partner engineer at dbt Labs I use they/them pronouns and I will be your host for the session.

You’ve joined us for an amazing panel discussing what is necessary to make the data warehouse, the right data platform for operations. I know Arjun is planning to introduce everyone properly, so I will not steal his thunder. Before then,I have some quick housekeeping. All chat conversations will be taking place in the #coalesce-hightouch channel of dbt Slack.

If you’re not already part of the Slack you have time to join right now. Visit community at getdbt.com and search for Coalesce Hightouch and. when you enter the space. In the Slack channel, we welcome you to ask questions, comment, and react. At any point I’m challenging you to end Day 1 of Coalesce with great [00:01:00] questions and maybe even better mean. I’ve told all of my friends that Coalesce often feels like a AOL chat room.

So please support me here. After the session, our speakers are going to be available in the Slack channel to answer all of your questions. However, we really encourage you to ask questions at any point during the session. Now let’s get started.

Arjun, over to you.

Arjun Narayan: Thank you very much, Amy. I’m delighted to be here. More importantly, delighted to be here with an amazing panel today. We’re going to dig into the operational data warehouse and what even is Reverse ETL. But before I get into that, I’d love to introduce everyone with me today. We’ve got Ashley Van Name, who is the manager of data engineering at JetBlue.

We have Calvin French-Owen, co-founder and former CTO of Segment. We have Tejas Manohar, the co-founder of Hightouch and Rachel Bradley-Haas, the co-founder of Big Time Data. And thank you so much, Amy, for the introductions. And I guess last there’s also [00:02:00] me. Unfortunately, you have to deal with me. I’m the co-founder and CEO of Materialize.

[00:02:05] What is reverse ETL and the operational data warehouse? #

Arjun Narayan: All right, so I’m excited to kick it off today. By starting with what even is Reverse ETL and the operational data warehouse, and Tejas, I’d love to get your take on this.

Tejas Manohar: Cool. I can take this one. So I’m Tejas one of the founders of Hightouch and basically how we think of Reverse ETL or operational analytics is the idea of actually using the data warehouse for more than analytics. So the idea is today, most companies have a data warehouse and they’re starting to use that information in BI tools to like answer questions, reporting and analytics for our business. But there’s so much opportunity around an organization to actually serve teams that want to use all this rich data and the data warehouse closer to their actual business operational day-to-day workload. So the idea of operational analytics is just that it’s taking all the work you’ve done in your analytics stack and operationalizing it by bringing it closer to all the business team and Reverse [00:03:00] ETL is a process of doing that, that we in Hightouch where we basically just move rom the data warehouse into different systems across sales, marketing ads, etc., like Salesforce, Marchetto, Facebook Ads, anything like that. It’s taking data from the warehouse and moving it to these systems, basically the opposite

Arjun Narayan: Maybe to ground ourselves with that definition, I’d love to understand some of the use cases that this is used for and in particular, very much I want to hear from the practitioners side of the house. Ashley, I’d love to ask what’s your favorite operational use case for data coming and data analytics coming from data warehouses.

Ashley Van Name: Yeah, thanks for the question, Arjun. We haven’t really dipped our feet all that much into the Reverse ETL.

Although I know it is a really new space and I’m interested to see where we could potentially take that. In terms of what JetBlue has done so far with operational analytics, we’re really focused on making sure that we’re able to pull as much value as we can out of the data that we [00:04:00] have stored and maintained in our data warehouse on a daily basis.

Now we get data from all different sources, batch feeds, real-time feeds. And the great thing about the warehouse is that you’re able to put all those things together and produce reports that maybe you wouldn’t necessarily be able to do if you were just pulling them out of the individual systems themselves.

You asked for an example. One of my favorite examples of this is one of the first projects I worked on at JetBlue, where we built out a tool to help our operations center predict and better handle situations where customers were misconnecting from their flights. Being able to actually see data being used to help drive a better customer experience is something that really motivates me personally.

This is just an example of something that you can achieve. If you’re able to make the data available that you need to satisfy those types of use cases.

Arjun Narayan: Excellent. Rachel, I’d love to hear from you as well. Given your experience building and implementing a lot of these across maybe a [00:05:00] variety of companies as well.

Rachel Bradley-Haas: Yeah, definitely. So I think one of the most amazing things that we’ve been able to do is bridge that gap between product data and your CRM or third-party tools.

So Like Ashley mentioned, you bring it all into the warehouse. You have a lot of product data that doesn’t necessarily make sense modeled in the same way that you have this historical account. Model that’s in say Salesforce or whatnot. And so one of the things that we like to think about is how do you support these freemium online usage all the way through enterprise customers by replicating the way data is modeled in the production database, whether it’s organizations, workspaces, all that stuff.

How does it relate to say an account level thing like Salesforce of the world or Fidelity. How do all these individual users or organizations map to it in a cohesive model that makes sense to sales teams or to marketing or to support. And so a lot of what we do is build these very intricate, very specific data models that support each individual use case and then we’re able to systematically sync [00:06:00] that data up into the CRM tool of choice and make sure that it makes sense for how sales or marketing wants to engage with these different customers.

Arjun Narayan: Excellent. Calvin, I’d love to hear your favorite use case for operational data.

Calvin: Yeah. I guess I’ll share a couple and maybe before I dive in, I’ll just give some very quick background on Segment, just so everyone knows what it is.

Segment essentially gives developers an API to collect data about product actions from their website, their mobile apps, different internal tools and send that to any different downstream destination that they want, one of the most popular ones being a data warehouse. And so we use Segment a lot internally. It’s a poll in data. Like people signing up, creating new sources, like inviting teammates to their workspace, that kind of thing. And it all ended up in the data warehouse. But in a lot of cases, you want data that’s maybe not [00:07:00] happening on the website. For instance, we would pull in a lot of things around, like hey we got an email that signed up.

How much traffic does that domain get per week? How many users already exist from that company? If it’s a big account, like IBM, what’s the time between when that person signed up and when they actually started sending data, which was a totally out of band process. Are they signed up for a self-service Stripe pin, that kind of thing.

And essentially we’d have all these different sources of data that we needed to end up in analyzing in the same place. And so that’s where we put them in our warehouse and having them in the warehouse allowed different teams to take action on them, depending on what they needed to do. So for example, our sales team got a report saying, Hey, if you respond to a lead within the first hour of them signing up basically their potential to buy, it goes up by like, 5x.

The problem is that we were getting 3000 leads a week, right? And so we had no way of being able to reach out to all of them. So we had to take this data about whether they’d be a good lead, which was populated in our warehouse, join that together, and then alert the sales team to [00:08:00] engage a live chat via Chili Piper.

So that’s just one example of more complicated things that you can do that provide a huge business impact. If you have all the data at one place.

Arjun Narayan: Awesome. And I’m finally going to ask Tejas for your favorite use case.

Tejas Manohar: Yeah, so a lot of the popular use cases have definitely been taken at this point, but I would say that one of my favorite use cases are really things that you don’t expect to be powered by the data warehouse. One of our customers, Blend, which is a company that helps banks provide software to you and mortgages and loans and stuff like that.

They actually use Hightouch to not only, power sync into their CRM and marketing solutions, but to actually power these real-time alerts and workflow automations across Slack and some of their project management tools used by the customer success team like Asana. It’s actually ping the right people at the right time when there’s something interesting going on in their customer record or in their in-app product usage that those folks should know about or take action on.

So I think that’s super interesting because usually, you imagine, these use [00:09:00] cases handled by something like Zapier or a little like Slack API call here and there on a web hook. But by being able to use all the data in the data warehouse and the source of truth, you’ve built. From your customer information there to actually power these workflow automations, it’s just a lot easier and a lot more powerful.

[00:09:19] Why is the warehouse involved? #

Arjun Narayan: You’ve somewhat anticipated my next question, which was going to be why is the warehouse involved? Like we often think of the warehouse as this end destination for all of the data that didn’t stared at, by these sort of questioning philosophers, deep in thoughts, staring at some complex dashboard.

Why is the warehouse a central feature of this architecture and why is the data even being used downstream with the warehouse, as opposed to all of this stuff happening upstream. I’d like to throw this to you, Rachel. What would you do if you were forbidden from using a warehouse is maybe a good way to analyze. What are the alternatives? Why the warehouse?

Rachel Bradley-Haas: I [00:10:00] don’t want to think of a world without a warehouse. So I don’t even know how to answer that question. It’s like saying, how would your nervous system react if you didn’t have a brain?

I don’t know. It wouldn’t work, or it would be sending very incomplete information. And I promise I did not prep that analogy, but it’s really what I was thinking about. It’s why do you send everything to your brain? Because your brain is built to process and handle that heavy workload, and it can bring information from different areas of your body and be able to say, okay, we know this about this, you know this about this now, where should my arm be moving?

If I see a soccer ball flying at my face, if it doesn’t tell my arm, you need to block it, that’s not going to be pretty. So I think that’s how we should think about it in the warehouse. It’s like all of these things are coming in and only your brain or only the warehouse and the people that know how to scale or understand all the business caveats are in there working to understand what do you need to do with it, where do you need to send this information? There’s a lot of built-in knowledge that goes into tools like dbt or all of these different transforms that say [00:11:00] when this data point comes in here and it matches with this other thing from a different source, you know X needs to happen. Now we already need to send X.

And so I think it just allows finally, for, I guess everyone to finally invest in one place where, you know, the knowledge and it’s like a huge brain dump of how does all of this connect and then you send it to different places, which makes it a lot easier if you’re rebuilding that logic. If my hand had a brain over here and my other hand had a brain over here, you’d have to be sending so much information back and forth.

It would just be like very overwhelming. So I don’t want to imagine a world without a warehouse. And at this point I definitely don’t want to imagine a world without Reverse ETL. It would be redundant, all over the place in my mind. So hopefully that answers your question.

Arjun Narayan: That’s a great analogy because sometimes interacting with some organizations, you really have this frustrating feel where, you’ve just talked to somebody at this side of the organization. They’ve told you one thing, you’ve talked to another one. They’re like, hello, who are you?

Rachel Bradley-Haas: I swear, it’s one of those [00:12:00] things. You reverse engineer things, way more than you ever wanted to when you’re trying to compare metrics and it’s one of the most frustrating things cause you don’t feel like you’re moving forward. You’re just trying to correct what’s already been done. So it’s like the more time you spend on that, the less time you can actually go and understand what does that number really even mean?

[00:12:17] What are the challenges in using a data warehouse for operational analytics #

Arjun Narayan: Yeah. One flip side of this, the data warehouse is a great place to centralize information and it’s capable of storing vast quantities of information, which you need to have on hand, struggling to get away from these these phrases with your analogy. But what are the challenges in using a data warehouse for operational analytics?

It’s not a free lunch, right? It’s not perfect. There’s probably things that frustrate you on a day-to-day basis. And Ashley, given that you use a lot of Snowflake for data analytics, what could be better with operational analytics?

Ashley Van Name: Yeah, it’s a great question. And before I start, I will say I put my bets on Rachel being one of the most quoted [00:13:00] people from today’s sessions on her great analogy of the brain and the warehouse. I love that. Okay. Yeah. So in terms of what bottlenecks have we been seeing? Of course we have our own set of challenges when you try to take the warehouse to be as real time as possible.

A lot of things happen when you’re at scale, right? When you’ve got millions of messages coming into your warehouse for a given table on a single day it gets really hard to deliver and serve up that information to the receiving applications, in a timely manner. We saw this happen in our own implementation.

We were one of the first groups and to maintain one of the first projects that has a dbt Lambda views. I think Amy Chen wrote an article about that a couple of years back. So Lambda views are great. But when you have a million messages that you’re trying to flatten in one second and Snowflake, it’s a little tough.

And even when you scale up your warehouse, you’re, number one, spending a lot more money than you probably want to. And number two, you still might not get the data back in a reasonable amount of time, depending on, what your SLAs [00:14:00] are for the downstream applications or reporting tools that you’re maintaining.

I will say, we’ve had a lot of progress recently in trying to change our ways in a bit to do a little bit more of transformation on ingestion versus just a pure ETL load. We found that inserting raw JSON messages into a raw table and flattening it with dbt was just, it really was just too slow.

And so what we ended up doing was picking out a few key fields and inserting, flattening those on ingestion and putting them in their own columns so that Snowflake can leverage its own pretty smart partitioning logic behind the scenes to help increase the speeds in which you can get results back. You know what I’m trying to say. It ends up getting your results faster once you do that pre transformation.

Arjun Narayan: Yeah. It sounds, it sounds like speed is maybe the biggest cost of using the data warehouse today.

There’s an excellent question in the chat. I want to thank Luke for it. Which is, maybe we have the perfect audience here and maybe we can have [00:15:00] Calvin and Tejas duel on this one. I’m gonna throw it to you Calvin first, which is what’s really the difference between using a CDP and using your own warehouse. Both tools can sync data at the SaaS providers. Like why would you use a warehouse in the flow versus just use a CDP?

[00:15:18] Why would you use a warehouse in the flow versus just use a CDP? #

Calvin: So when I think about a CDP traditionally I think of systems that are like primarily real-time and primarily focused on marketing automation use cases. So if you think about the way Segment works, for example, you’re collecting data from your website is transferred, 1 to 1.5 seconds to a bunch of different tools that you might be using, whether they’re like an email tool, like Kaspr IO or ads tool like Facebook Ads or a help desk like Zendesk or Front, something like that. I think where a warehouse really shines is when you don’t care as much about latency, but instead you care more about having data completeness. In particular where you care about joining that data together.

So remember when you’re like [00:16:00] streaming these little bits of data through, you’re probably not being able to say Oh, okay, how does this user compare to others, for instance. Are they like 10th percentile or the 90th percentile? Like you can’t get that from a real-time data system, like a CDP. And so that’s really where I think a warehouse can shine where you can run any query and pull that information together.

And then you use Reverse ETL to sync it to those other systems. It’s sort of a latency-amount of data trade-off.

Arjun Narayan: It sounds like, in many ways the thing that rules out data warehouses is where latency is just such an absolute requirement that low-latency is such an absolute requirement that you can’t bring a data warehouse into the loop. Although you might want to bring it in, if you really need to join it against different data sources to make it more sort of this comprehensiveness versus speed of action. Is that a fair characterization?

Calvin: Yeah, I think so. I mean like data warehouses are also getting better and better at the fast use cases, too.

But when you think [00:17:00] about how the systems are set up and designed, a system that’s designed for streaming inserts, or you just take a little bit of data and go like a Kafka, has fundamentally different properties and cost envelope than one which is designed to store that data and analyze it for all time.

Tejas Manohar: One thing I would add here, which is obviously biased coming from the Reverse ETL and Hightouch perspective, and for context, I used to work at Segment. So something I would add is really what we see in our clients is that a big reason that people are starting to turn their data warehouse into a CDP using something like Reverse ETL, so basically taking the data warehouse, turning it from this kind of cold storage where they’re just like, ask me questions, something that’s live serving all these use cases around the company, isn’t because it’s necessarily the fastest database in the market, right? That’s definitely not the reason.

The real reason is because it actually has all the data and it has all the data models that companies and business teams just need for those use cases. So oftentimes I tell clients that the [00:18:00] reason people are turning to data warehouse into a CDP, is because it is your customer data platform.

It is where most of the customer data exists around your organization, where the most data models and the latest data models are being developed first. And a lot of other things follow in the organization. So what we find is that time to value is honestly just one of the biggest reasons people are jumping on Reverse ETL.

There’s kind of all these problems around an organization that can just be solved by adding a little bit of data for this tool or that tool, or a structuring in a different way, or running a joy or an aggregation, a bit of a transformation and something like dbt or SQL and with the data warehouse where you can easily get all the data, you have infinite amount of tools like Fivetran and all those sort of providers, as well as tools like dbt to transform all the data, you really can easily build a source of truth there.

And with Reverse ETL, you can make it actionable as well, but moving into the rest of the system. So for us, it’s just we find that it’s just something, the fastest path to solve some of the problems that business teams are facing when it comes to using data.

Arjun Narayan: This is a [00:19:00] great point, which is, the time to value of using data warehouses and Reverse ETL is very high. I am also pretty biased here. I think a lot of that is because of the power of SQL and the fact that you can write and get so much done with so little SQL as opposed to writing custom programs. But if you’re interested, if someone in the audience is interested in this and I want to thank Janessa for this question, which is, there’s a lot of use cases and there’s a lot of potential, but what would be your recommendation? What are the top two or three use cases where people could use to get started? If say they are trying to start their journey of making their data warehouse operations.

[00:19:42] What are the top two or three use cases where people could use to get started? #

Arjun Narayan: I can take this. A very simple one, for example, as I mentioned, or think I mentioned, I work a lot with B2B companies. And so what we ended up finding is that there’s not a lot of product usage available in your CRM tool, or what is in there is manually entered and whatnot. So if you have a very easy way to tie what’s going on [00:20:00] in the product with how you’ve mapped it in Salesforce, it’s very easy to say, Hey, we have a table of what the product usage is.

In our data warehouse, even just starting with a single column, for example, that says like, how many API calls did this client have in the previous 30 days? Or how many active users in the previous 30 days, and just writing to a custom field on an object in Salesforce, it’s so simple, but it just is the easiest POC.

Next thing you’re going to have custom objects that you’re going to want to be filling out that then trigger flows downstream. It’s just like starting with one very basic thing to show. This is how powerful it can be. And then it’s just going to explode from there in such an amazing way. It’s been really fun to watch a lot of our clients go that way.

You set up one thing going to Salesforce. Next thing you know, like Tejas was saying, it’s like, can we do this to Slack? What about Intercom? Can we send all these different things? And so they say, oh, we love this data in Salesforce. It would be so great if it was easy to now send it to a different tool.

I just went and implemented the dbt data model. All you have to [00:21:00] do is add a new destination, sync the exact same field. And they’re like, "oh so it won’t take the same amount of time. I’m like, no, it’ll probably take me five minutes". And so they love that and they just realize, I guess the one thing I would say is as programmers, we all hate repeated code, right?

That’s like a no-no. And so that’s what Reverse ETL does. It’s like you build these cohesive models and then you don’t repeat your code. You just point to the function and say, go run this. And then you pointed to a new destination. So yeah, that’s probably the easiest way I would say to get started.

Tejas Manohar: Yeah, I think that’s spot on. I think the ability to just start really quickly and just tackle whatever use case is top of mind, and then incrementally add more and more is where the whole data warehouse and Reverse ETL approach shines. If assuming you’re at a company that’s already invested in some of those prerequisite data stock steps first I think the example of a CRM, a sales CRM, or customer success CRM on the B2B side of the house for B2B companies is super spot on . That’s one of our top use cases that I see with people doing Reverse ETL. Another one [00:22:00] that I would mention is just like most consumer companies or even B2B companies for that matter, pretty much all have demands to run more and more personalized and targeted marketing initiatives, whether that’s on email or push and Facebook ads, Google ads, whatever it is, they just want to leverage all the data they know about a customer to make that marketing initiative more relevant.

Typically, one of the biggest bottlenecks to do this is not that they don’t have a tool that can slice and dice data. It’s just that they don’t actually have the data in those systems. So the Reverse ETL is a great solution to that because you just take a SQL query and say, I want this data to show up in this system or that system in a really flexible way without changing anything about how the data is collected or how it’s modeled or anything like that, you can use it in the formative right now.

And that’s like probably the second, most common use case that we see at Hightouch.

Arjun Narayan: Awesome. Thank you both of you. One question is, how do you set up your data platform to actually be used by the end users in [00:23:00] the rest of your company? If you were in a team that’s providing this sort of maybe as a service, what are some sort of practices that you would adopt? I guess I’ll start this with Ashley.

[00:23:09] How do you set up your data platform to actually be used by the end users in the rest of your company? #

Ashley Van Name: Yeah. This is always an exciting question to answer. I think the biggest thing and the biggest principle that I try to follow is to treat your data warehouse as a product, right? Imagine that you are running a startup and you’re building a product that you want customers to buy and use and talk about and promote.

You want your data warehouse to really be looked at in that light. I just hired a data product manager that works on my team and his main role is to go out and set up meetings with internal stakeholders that are power users over our data, ask them, how’s it going? Are you having issues with data?

What other data sets do you need to be added into the data warehouse so that you can do your job? I don’t know how it is, you know what everyone else has organizations, but previously we operate under a model where we pretty much just went through request tickets, right? Folks would set up time on our calendar, talk [00:24:00] about a need that they had.

And then we would act on that. We’re turning it, we’re turning the way that we work, sort of on its head. We are now being proactive and reaching out to folks to get their feedback. And then we’re trying to bake that directly into the product, the data warehouse. It makes people have a better sense of trust in the data team.

It makes people excited. And again, really just helps to promote the product so that it’s used by more people. Like I said, at the very beginning, we haven’t done a whole bunch of this Reverse ETL stuff, but I could imagine, the more people that you have in your platform, the more of these kinds of insights they’ll be able to think about, and they’ll say they may come to you down the road and say, Hey, I also use Salesforce and I see you’ve got 10 tables in Snowflake, and I think that I can really use them inside of this other software that I use. The more data in the warehouse and the more users, really the best value you’re going to get for your business.

Arjun Narayan: There’s a lot of talk about you do one thing and then other teams autonomously, they see them. So I want that. And I want a copy of that, and I [00:25:00] want to use that data. I want to build on top of that, maybe merge it. And this requires a certain amount of ability of the warehouse to support this sort of multiple multiplying of use cases. Do you have best practices that you create a new warehouse for each each team? How do you structure this sort of specifically under the hood? Rachel, you laughed a little too much. You’re on the hook now.

Rachel Bradley-Haas: I think, no, you don’t because I think the whole point of Reverse ETL is warehouse first approach, right? Where you centralize everything. And I think the biggest thing is you can have the hub and spoke model, but you still need to have the hub. So you can have people that specialize in bringing information, saying here’s the caveats of the data, but you still need to have a centralized data team that knows how everything interconnects.

No, I wouldn’t create a different warehouse or different database. Maybe, obviously you have your production database, but if you’re having your data warehouse that once again, we’ll call it the brain. You have one brain for a reason. I’m going to guess you have your left and your right brain, but we’ll just say that one’s production database.

Go to market [00:26:00] side. And so you really need it centralized to be able to build this data governance layer. Otherwise you just, once again, have so many different places that have different logic, different filters everywhere. No one has time for that.

Arjun Narayan: Would you build multiple tables?

Rachel Bradley-Haas: Multiple tables, and as Ashley said, one thing that’s been really nice is I almost feel like with these new tools, data, people get to be the hero. Finally. We’re not the ones that are just thrown stuff. You can go on. What’s your biggest pain point?

And then they think you’re a magician because you can, all of a sudden do this super easy things, writing SQL, and then you can like automate things in third party tools and they think it’s magic and you’re thinking finally, they don’t think I’m just a report monkey. And so it’s just really nice to finally be able to be a hero and have the tools that we need.

To do what we need to do in our natural language of how we process data and then send it to different tools. Thank goodness. I don’t have to make API calls. Cause I would never do that. I would rather quit before I started writing a bunch of API calls. So thank you, Tejas for creating a tool that makes it so I don’t have to do [00:27:00] it and I can just scale my abilities that way.

So yeah. Yeah, but different tables for sure. Based off of the functional buckets of work scheme is, are way more important in my mind than different warehouses. I think schema is and organization of it and file structure of like how you’re submitting code. All of that stuff is so important.

But definitely keeping it very, organized is really relevant, especially when you’re going to start sending data to third-party tools. Because when something goes sideways, you want to know exactly where you need to look to figure out what was.

Arjun Narayan: And Calvin, I don’t know if you have an answer, how do you set up your data platform to actually be used by the end users across your company?

Calvin: Yeah, I wouldn’t say we ever a hundred percent nailed this at Segment though. Our data engineering team was really incredible. I think the way that we structured it is that individual product teams would be responsible for inputting metrics.

And I don’t know, let’s say you’re a product team who’s focused on the setup flow. Like you’re tracking things like people signing up, inviting [00:28:00] teammates, setting up their source, getting data, flowing, that kind of thing. And all of that would get dumped to the warehouse. Because again, I think the Tejas has point data has gravity and it’s better to put everything in one place so that when someone’s reaching for a report or a metric or a flow that they want to create, the data is all there.

And then what we do is our data enj team basically had a bunch of airflow jobs that they built on top of the warehouse where they would create what we call these golden reports. And the golden reports are like a cleaned up version of the table that like almost everyone at the company who wants to answer your question, knows what they are.

It’s okay. I need to like, get a list of customers. That’s a view that’s published by the golden reports or okay, I need to I dunno, see how many signups we got in the last week. That is a gold report, that kind of thing. And so essentially we had centralization around the metrics that were like company level metrics, stuff like revenue, churn, customers, that sort of thing.

And then individual teams could still reach for those product level metrics, which would be separate [00:29:00] tables and join across them. And that was a little bit more decentralized, but the stuff that needed to happen for like Q1 reporting to the board that would all be golden reports. And so that way we had a little bit of give and take where it’s okay, make the stuff that everyone needs easy.

And then if you want to do your own exploration, you can. But you may have to do a little bit more work to figure out where the data is.

Tejas Manohar: Yeah. I think one really interesting thing here is the handoffs between kind of the data professionals and the data team and the rest of the organization. And that’s where I think there’s actually a lot of room for improvement and a lot of room for innovation in the space, whether it’s, BI or Reverse ETL or CDP, whichever kind of category we’re looking at in data here.

One thing we found in our application is there are folks in the organization who wanted to find the core data models, like the golden reports, what Calvin mentioned in a dbt or in a platform like that. And write those back to the warehouse. That’s accessible to everyone.

Sometimes those folks also want to be the ones configuring [00:30:00] how that data looks in a Salesforce or NetSuite or how the Slack alerts fired to, various different people on various different conditions to data. But sometimes they want to hand off that last mile of deciding what to do with the different data points to the actual line of business team, like someone like a sales ops role or rev ops role marketing, tech role, something like that.

And we find that’s where the idea of Reverse ETL becomes super helpful because you can have the data folks that really define the core models and the models feature of something like high attention, then have line of business folks actually come in and build sinks and use the terms that they’re familiar with in Salesforce or Marquetto or NetSuite or slack to actually just talk about where you want the data to go and how you want it to look without figuring out how to define things like lifetime value or how to define things like key events of a user. And that’s where I think Reverse ETL has a lot of unique things over like CDPs would try to combine all of these marketing and data needs into one platform in a lot of ways.

Arjun Narayan: A fun question [00:31:00] for you Tejas. Why is it called Reverse ETL? What does Reverse? What is Reverse? And the ETL process just sounds like regular old DTL.

[00:31:09] Why is it called Reverse ETL? #

Tejas Manohar: Yeah. So I think you don’t always choose your category name. I think that the category chooses you. So we went with the term that our customers would always ask us, or prospects or whoever we’re talking to.

They would always be like, oh, is it the reverse of those ETL products? Like instead of those ETL products a Fivetran that moves the data from all these sources stems into the warehouse, you’re turning the source system, like a Salesforce into a Destiny. And then the warehouse, the source, you’re moving data from the warehouse into these different systems around the company.

So honestly people just kept saying, oh, it was like a reverse of a Fivetran or reverse of a stage Reverse ETL. And we just decided if people were going to say that we, we may as well latch on it and just start using that everywhere since it’s what people are naturally coming up with. Yeah, there are the people, the 10% of people who are like, this doesn’t make any sense.

ETL doesn’t have a direction associated. But in [00:32:00] reality, it just sticks. So we go with that.

Arjun Narayan: Yeah. I think I am one of those who it’s just ETL, but I think I’m the minority and the people have spoken.

Tejas Manohar: Yeah, to be fair. When it comes to ETL, it’s just getting the data in, but Reverse ETL, you might actually want to define more workflows or, defined like actual more use case specific stuff and the requirements.

Farther until, the business team actually sees value than just like dumping the data in the warehouse, especially in an ETL world.

Arjun Narayan: How would you go about documenting all of these syncs in a world where there’s a proliferation of Reverse ETL?

And I want to thank around for the question in the #coalesce-hightouch channel There’s something self-documenting about a warehouse. Not really, but at least it’s all in one place. How do you go about understanding, all the various integrations, all the various syncs, what sort of data catalog for Reverse ETL.

[00:32:53] How would you go about documenting all of these syncs in a world where there’s a proliferation of Reverse ETL? #

Tejas Manohar: Yeah. So I would say one thing we’re hearing more and more demands from customers is an easier way to get a bird’s eye view of [00:33:00] all the pipelines and all the moving pieces. When it comes to a data stack, especially as they adopt more and more SaasS solutions. Disparate for each part of that stack.

So I think, dbt is putting out a lot of interesting stuff in the core here, like exposures at Hightouch like we natively integrate our platform with dbt in a way that, it’s like models from your good repo. And we can also sync all kind of the config of Hightouch syncs back into a get repo and stuff like that.

So we have explored things like auto generating exposures for your dbt projects. When you look at the dbt’s DAG, you can see not only how the models are being produced, but then where they go after that, this one is going to Salesforce via this Hightouch thinks this one’s going to that sweet beer.

It’s other, high-trust just saying to be honest, there’s not like one way of doing it. I would say different teams are doing different things, but that’s really something that we’re looking to do more and more of. And I think exposure exposures is just a step in the right direct.

Arjun Narayan: One question that I’m thinking about is it’s the difference between say, doing Reverse [00:34:00] ETL on top of your warehouse versus building applications that sit directly on top of the data warehouse. So the applications issue SQL queries directly to the warehouse and you keep it running.

In which case would you recommend that. Folks choose the former or the latter. And I’ll throw this open to anybody who has an opinion on building applications.

Tejas Manohar: I’ll take this one. It’s definitely an interesting idea. I think we’re seeing it. We’re seeing a bit of new stuff coming on in that space. A couple of things, one there’s just so many applications across companies, I would guess a 500-person company, probably. If they’re like a tech serve, they probably use over 50 SaaS tools.

So it’s just this SaaS tools are basically built around this primitive of an API. And will they be built around this primitive of a data warehouse or SQL query, what to see over time? The other thing I would say is that companies want to central way to manage like how data is integrated across all the systems.

They want to manage things like GDPR and deletions and access control and stuff in one place. Given the level of scrutiny we already see, with all the measures we’ve taken, all [00:35:00] the compliance, stuff like that. We’ve taken the level of scrutiny. We might see to connect to a customer’s data, warehouse it at a Reverse ETL company.

Like Hightouch I would imagine that only gets worse if every single vendor wants to connect to the company’s core data warehouse versus one kind of central place to define all the models to every. Regulate everything and see observability like alerts, metadata, and one central place. So personally I’m of the belief that most teams will focus on their core competency and just have a way to get data into them, like an API like what we’ve seen over the last decade.

And then there will be platforms that help companies, use things like Reverse ETL to actually execute on that and get the data into the right places and abstracted between tools and model and all that stuff.

Arjun Narayan: Yeah. Anybody, any anybody ever created a a cycle? I think we’ve talked a lot about DAGs, but one problem with the Reverse ETL is we’re putting data back into these systems. You have the potential for cycles now.[00:36:00]

Tejas Manohar: Yeah. So people always ask about cycles, but we haven’t actually seen one in practice. I think one thing that we do and I think all the Reverse ETL platforms probably have invested in as well. It’s just like this idea of drifting. So this idea where we only send changes to the downstream tools like Salesforce, if there’s actually been a change in the source data set, you can actually compute using, the source system, like something like a Snowflake or a big query, or even, a system like Materialize might give that to us out of the box.

And that prevents cycles and a lot of ways to you cause you don’t end up sending every change from a source system back into it. And you only send it when there’s actually a change. Model and kind of the data warehouse or the Reverse ETL brain sort of reconcilers that to figure out if there is a change and if a signal that needs to be sent over.

So people are always fascinated with the idea of oh, if you’re emailing an Aristotelian, you’re going to have these cycles, but in practice, we haven’t seen them at least.

Arjun Narayan: Yeah. Sounds like the data warehouse is great. And if there isn’t only, if there’s one gripe that people have is that maybe it [00:37:00] could be faster. Sounds the world is really crying out for a real-time data warehouse so much again on that

Tejas Manohar: Everyone to check out Materialize

Arjun Narayan: I couldn’t resist that. The final quip about that.

Tejas Manohar: If data warehouses had truly real-time SQL, I know there’s some almost real-time SQL, people are doing streaming in serious and big query, stuff like that, but truly real-time SQL crossing. Those querries is as the data comes in, it’d be very exciting for a lot of our customers, at least in it, and definitely for the whole.

Arjun Narayan: Yes. I think the one big learning for me in paying close attention to this area for many years is that the SQL is not going away. And if anything, it’s unleashing a in order of magnitude increase in the kinds of things folks can build, because you can build pretty much anything now with just SQL.

And that I think is empowered a lot of folks.

Rachel Bradley-Haas: Oh, I was just going to say one last thing. I was going to say is like the only gripe I think people have against the data warehouse [00:38:00] is solved a lot. Besides the speed thing is solved a lot by Reverse ETL. And the fact that not everybody has the skill set for SQL to get their hands dirty.

And so I think the last thing I would say is Reverse ETL solves it for those people that can’t go and query it. They’re able to reuse these things. So it’s like one of the other things I would say is just like so many people are like, I don’t know how to get data. I don’t want to get data now.

It’s now possible to send data to them in the way they want to consume it. So I think those are the two major things that I’ve seen a lot of.

Arjun Narayan: Yes. That’s an excellent point. We’ve assumed that it’s a given.

Rachel Bradley-Haas: Probably not on a high note for Reverse ETL versus speed.

Last modified on:

Previous: Frameworks & best practices

Next: Built It Once & Build It Right: Prototyping for Data Teams

Table of Contents