Table of Contents

Built It Once & Build It Right: Prototyping for Data Teams

Discover more about author and dbt enthusiast Alex Viana.

This talk is about how modern data teams working in the “data as product” and “analytics engineering” models can benefit from using prototyping practices common to product and engineering functions — and why it can be hard for them to do so.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

Barr Yaron: [00:00:00] Hi everyone. And thank you for joining us at coalesce. My name is Barr and I work on product here at dbtlabs. I’ll be the host of this session. The title of this session is Build It Once & Build It Right: Prototyping for Data Teams. It’ll be led by Alex. First some housekeeping. All the chat conversation is taking place in the #coalesce-prototyping channel of dbtSlack.

If you’re not part of the chat, you have time to join right now, visit our slack and search for #coalesce-prototyping when you enter the space. Okay. Back to Alex, Alex is the VP of data at HealthJoy, a Chicago based health core healthcare platform. Alex started his career in astronomy, leading analytics projects and writing data pipelines for the Hubble space telescope. So cool.

He then spent five years working at an early [00:01:00] stage startup in the information security industry, focusing on data breaches on the dark web.

After holding leadership positions in engineering and in product, he moved to HealthJoy, where he leads a data team. I am so excited for this. As someone who’s worked both in product and in data, I’m much more familiar with the concept of prototyping and product versus in data.

This talk is about how modern data teams working in the data as a product and analytics engineering models can benefit from using prototyping practices that are more common to product and engineering functions and why it can be really hard for them to do. After the session, Alex will be available in the Slack channel to answer your questions.

However, we encourage you to ask questions at any point during the session and he’ll get back to them at the end. Let’s get started and I’ll pass it over to you. Okay.

Alex Viana: Hi, thanks for the intro Barr. An [00:02:00] important piece of bookkeeping for, we start, if you read Lauren’s email this morning, this talk is represented in potato world as a tater top. So if you are not here for the tater tots, Presentation you are in the wrong room, not to mix food metaphors here, but hopefully a tater tots are your jam, and hopefully this talk will be as well. So let’s dive into it and let’s talk about prototyping for data teams. Okay. One, one more thing just before getting into kind of the core problems here. I pulled, I mentioned in my bio coming from a couple of different areas, a couple of different industries, very different industries, and having worked in engineering product and analytics.

And I wanted to bring that up because I want to try and pull some of those different viewpoints and perspectives into the problem we’re going to talk about today. And so what exactly is that. So I like to try and put the punchline at the beginning as it were, and give the top [00:03:00] context.

So everything I’m saying should roll up to this one point. What we’re going to be talking about today is why I think you should ensure that you’re building the right thing by creating cheap prototypes before you do the real. That’s it seems simple on paper, but actually, in my opinion, it’s a very senior set of communication and data reasoning skills that I think are deceptively hard to nail down, but I’ve personally found that to be extremely productive when you get used to applying them.

[00:03:29] The Problem #

Alex Viana: So that’s the overall arc of where we’re going. We want simple cheap things that prove that we’re prove out before we start going down the road of building things that this is going to add value. So let’s start our journey on this presentation, the same way we would approach building a product is making sure that we all understand the problem we’re trying to solve before we get any deeper.

So this is a talk about at its core to [00:04:00] me. Partially that’s because I’m just naturally self-deprecating and I like to exaggerate, but I think that’s really what we need to be careful about here. I’m not talking about mistakes or bugs things that might need a few iterations to work out. Those things are important, but we have other tools for those problems.

And if there’s even other talks at this conference that give good starts and how to address them, this is the talk about avoiding that sinking feeling in your stomach when you’re in a big. And you are presenting exactly what the stakeholders asked for only to find that they didn’t use it, or they explained to you part way through that.

This is actually one of the tip of the iceberg. The problem is actually much bigger or they just flat out say, no, this is not what we asked for. We can’t use this, or they tell you that there’s some giant assumption you’ve made that is completely wrong in the context of what we’re talking about here.

Those are the things that I really I think keep me up at night, keep a lot of us up at night. And that’s what I’m really targeting here. So this is a talk about building the wrong thing. I think all of those things we just talked about are examples of building completely the [00:05:00] wrong thing, solving the wrong problem, or attempting to solve a problem that actually can’t be solved due to a lack of data, a lack of alignment on definitions.

What have you, on one hand though, this should be impossible, right? We’re all smart people. And so we’re our stakeholders between us and them. We have a combination of technical skills, subject matter expertise. We have roadmaps, we have requirements, we have action items and deliverables. We follow best practices.

This just shouldn’t be possible. And yet it’s something in my career that I’ve gotten wrong. Numerous times, let alone the times I’ve seen and spoken to my peers about this going. I don’t think my experience is unique. I’ve had a lot of people give me feedback that, yeah, this is what a talk on this talk is something they, they experienced too. And I think it merits an entire talk to dig into this. And I think there’s particular issues around this for data practitioners that we should take special care about because the data role has evolved into this independent high-impact role.

In fact, we’ve been driving this transition, right? I think that’s evident everywhere in this conference. [00:06:00] We’ve asked for. Two key things in my estimation, the first is to have an independent executive level function or at least close to it. And the second is a suite of powerful cloud-based tools, such as dbtand others that give us a level of autonomy and productivity we didn’t have even five years ago, but on the downside, we’ve also gotten what we’ve asked. We now are increasingly important and independent here, right? And when you get something wrong, you’re not just messing up a spreadsheet anymore.

You could have made the wrong choice on an architecture, on a contract to support that architecture on a business partner on a well tested and documented very clever pipeline that’s actually wrong because you had the wrong requirement. Executive stakeholders are finally on the same call after months of wrangling schedule and you show them something and they say, that’s not what they expected, or my favorite after months of check-ins and calls somehow, no one seems to know exactly what to build though everyone seems to be able to agree on everything.

This is the role and responsibility [00:07:00] you’ve asked for. So I think we really need to take these concerns seriously and take a leadership role in solving them. And that’s the specific context I’m thinking about in this problem. And I think it’s, obviously relevance to folks at this conference.

Okay. So as I see it, there’s a pretty clear pattern that gets us into these kinds of problems, right? It’s not the only way, but a failure mode let’s call it. I see, is that we build stuff to figure out if it was. And for most of us, I think that is our kind of our natural instinct. We’re excited by our cloud tools, our technology, we like getting our hands dirty with the data we want to explore.

Maybe there’s some new programming, techniques. There’s probably all kinds of stuff at Coalesce that you’re just chomping at the bit to go and build. And we’re all smart people. I’m sure we can eat our way through it or iterate our way through this. And let’s just take a crack at, but let’s stop and think. How do you know you’re building the right thing.

I think we need to have a certain paranoia. If someone were to stop us and say, why are you building this? No, really? Why are you building this? Can you prove that there’s going to be value at the end of [00:08:00] this? Or at least give some evidence of it? To me, it’s a way of valuing the investment of our time and money and labor. I’m showing that yeah, it is important to at all levels.

So this kind of brings us full circle, right? How do we do this? We want to make sure that we spend our time and energy wisely. So I think we should look at building prototypes and to be clear, right? If we’re talking about a project on your end, that’s a few clicks to experiments, to drop a plot, maybe up to an hour of stuff, you don’t need a prototype.

But for bigger things, for things that are new and even for things that are actually only medium sized, maybe only a few hours of work. Even those could benefit from a little bit of upfront prototyping to make sure that you’re heading in the right direction. But that’s enough theory. Let’s look at an example.

[00:08:47] Story Time #

Alex Viana: So I’m going to tell you a story and I’m going to be upfront that this is a kind of up story. It’s not completely fiction its basis is real, but I’ve pulled in elements from a number of different projects to give it some depth and some spin. And you’ll [00:09:00] also notice, of course I’ve faked the numbers I’m going to use in this story.

But the point of the story is to try and unpack how it is. I think we can get led astray and into some more concrete terms. And there’s a bit of detail in here because I think it’s important to view this as kind of social and psychological, right? This isn’t just some technological thing. It’s really about interacting with people and stakeholders that makes it hard for us to see the forest for the trees, even though everyone seems to be trying.

So the story I’m gonna tell is some of my colleagues came to me and they wanted to know if they could, if we can understand how many people were using our chat feature at health joy. And I told them I could write, I didn’t have access to the, any of the data, but I pulled them up this new analytics engineer role I was working in.

I, this is where I could really shine because I can get everything together in just two weeks. We don’t need any. Sprints and we don’t need to put post-it notes on the windows or anything. I got this. But they were impressed that it was, I hope they would be two weeks, sounded great to them, and I went to work, talking to our infrastructure team, getting [00:10:00] holes in the firewall, getting the database connectors, spinning up some dbt DAGs and then modeling it in my BI tool and finally creating a dashboard. So the night before and working on this and I’m putting together slides and I’m getting this sinking feeling in my stomach.

I put together this great pipeline and I was, I had these slides explaining all the great stuff I’ve done, but what I got to the analytical part of it, it was a little less sure. I come up with this average of five users per day. Obviously again, I’m faking these numbers. But. It didn’t quite seem right. But I felt like I had done everything right. And I showed it to folks and there was some shifting in the chairs and everyone thanked me for my hard work. But what they told me, what they really needed was a daily time series. Use number of users per day. They wanted to see trends in the daily usage.

And I said, you know what? That totally makes sense, but I actually need new data for that. I’m just, sorry, I just didn’t model this in a flexible way. Can I got some other projects. Can we meet back in a week? And they’re like, yeah, totally. That’s fine. I was a little embarrassed. The people were like, look, this [00:11:00] data didn’t even exist two weeks ago. You’re doing great. This is no big deal. We have other work too a week for now is break. Cool.

We come back now we’re on week three and I say, all right, look, I know I understood what you want. I feel much more confident. Cause you explained to me here’s your time series and you know what? I felt bad.

I did a little button, click machine learning through a regression on there. So there’s a line of best fit. And I felt much more confident now I had done my modeling. Documentation. I’d written some tests in there and had pulled in some extra data. So now if they wanted to see things in terms of weeks or months of quarters, I’d gotten ahead of all that, right?

So we could a couple of button clicks, show them different graphs, and my stakeholders are happy now. And they started talking about, the missing chats and the anecdotal evidence they’d had of missing chats. And I was confused because I’d never been asked about the missing chats before. And they said, oh what they really cared about is they want to take these chats. They want to see if there’s any correlation between the number of users and missed chats. Maybe we just didn’t have enough agents during those times concierge [00:12:00] on the other ends worked in the chats and I told them, okay I actually have no idea about that. I need to go talk to engineering. Maybe you can get some more data on this.

And it’s going to take some new pipeline connectors and actually have some time off coming up. Can we have two more weeks? And they were like, yeah, that’s fine. Actually, half of them had already gone because they were triple booked and they were just scattered into the wind. So three more weeks, two more weeks. I’ll be back.

So then we’re on week five of the project. I talked to engineering, I pulled in more data, more tasks. Come back together, but I’m starting to feel really good. I’m proud of this pipeline. I keep refining it. And now, I had to have a better and better sense of what they want. I’ve got these two data sets, now, one is the number of users. One is the other missed chats. And it was secondary plot where I plot these two variables against each other. And I was like, wow, this is what. Non stakeholders are super engaged, but they’re disappointed. They didn’t see the trend between the number of users and and dropped chats.

And I was like no, you mean missed chats? And they’re like, oh no. We mean drop chats. We’ve recently become concerned [00:13:00] that we’re actually dropping these on the engineering layer. We were hoping to see some trends between the number of users or something like that, that you could use that correlation doubts.

Do you know if we can actually answer this question? And I was like, yes, I do know the answer. The answer is no, I’ve talked to our engineering team extensively and we don’t capture that telemetry anywhere. Like we, we don’t track that. If you want to answer that question, you have to go talk to our engineering team and put something in our engineering telemetry.

This is this actually, isn’t something I can help you with. And they were like, you know what, it’s exactly right. This is an engineering issue. Thank you so much for your hard work. This makes total sense. You’ve really helped us explore this. We’re going to follow up with engineering and this is great. Thank you so much.

And I, this was an interesting story, but I think it’s a disaster, at no point, did I calculate anything incorrectly in a tactical sense and made no mistakes. And even more than that, I delivered exactly what was asked for it. Every step, as we love to say in these, in our industry, like we were in synch.

And yet this was a disaster. This was a red herring. The ultimate result of this is that we [00:14:00] didn’t have the data we needed. This was five weeks of elapsed time and maybe a week of analytics engineer work to end up to tell the decision to go look somewhere else. And the fact that analytics engineering and the modern data stack gives us these tools can be wasted if we’re running in the wrong direction.

But if you think about it, we should be able to avoid this result. Really think about it. We didn’t actually have to calculate anything to figure out that this was a dead end. We had all the experience we needed in the first day to arrive at the same conclusion. You ended up at any way, but with much less.

Like I said, I fabricated this story a little bit. I stretched it out, but I wasn’t this meeting. And we were heading in that direction. I’ve gone down that road many times. And so people I know and that’s the direction of building something that doesn’t have any immediate business value. So here’s what happened instead when we were at this first stage and we were already getting deep into technical requirement.

People are talking about data sources. Will this be real time? Can it be real time? Are you sure? It can’t be real time defining who’s responsible for what were the engineering [00:15:00] timelines? Everyone was doing a great job coordinating and being proactive, but I just, something felt wrong to me. So I interrupted the conversation. I said, okay. What if I tell you the average number of users is five users. And I’ll get some more of this later, but as far as I concerned, this is a prototype. I didn’t say what if I give you an average? I picked a specific number and I asked them to respond to it.

I said, tell me how you would use that value of five interact with that data right here in front of me in real time, pretend this is a tiny user test, right? I could have gone to the whiteboard and drawn this. I could have done this in Excel, I wanted to in this situation, just softball this out here to get the conversation going.

And of course they were like no, we need a time series. And I was like, but I wasn’t satisfied. I was like, okay, let’s say give you a time series. It goes up and it goes down. But the average is still fine. How are you going to use that? Tell me exactly what you’re looking for. If you looking for numbers that are higher, low, tell me how you would know if they’re higher, low, if you’re looking for a maps or an average or a trend.

Tell me what you would compare that to tell, to figure out if it’s a trend, where would that data come from? Like [00:16:00] really specifically trying to walk them through this. And so when we got to this Lex discussion that. I was really tempted to back down. I really had to stop. And when we started talking about compare users in this chats, raise their hand, the one, raising their hand and saying, I don’t understand.

I don’t understand how you get from A to B and C and I’m sorry we keep going backwards. I need you to walk me through it. And this is when he finally got to this potential underlying engineering issue about dropping chats. That was actually the root cause of concern chats, not even leaving leaving our network.

There was no data on that and this was an unsolvable problem. There’s literally nothing to be done. And when I did this, I felt a breeze like a bus had just misdeed on the street, right? This was, we would have built a fictional pipeline described early in this story. I’m almost certain of it. We were doing a great job of planning it, This, and we probably still have to build that pipeline to be honest, because this is important data it’s valuable to our business, but now we can build that pipeline when it makes the most business sense.

Not because there’s a misunderstanding of the problem. And we [00:17:00] have a much richer understanding of how we would use this data. And I think this, this somewhat fictionalized Example here is important because I can’t highlight it enough. I work with smart and motivated people. We were ready to just plan the life out of this and build it.

[00:17:14] Some observations #

Alex Viana: How did we all miss this clear that it’s not about intelligence or experience? Some of these people have been working on this system for years. What’s going on. All right. So I’ve been working on this idea of leaning more into prototypes and how we can make better use of them for over a year now.

And I have some ideas and some of the things I’ve seen on this the first and maybe the most important one is I think. That data is tactical. It’s something you really need to be hands-on with to understand. It’s very hard to think in theoretical terms about data and data products. And I feel like this is really a fundamental step for me, more and more.

And that’s why I have such a strong preference for making these seemingly silly little spreadsheets and like thought examples and whiteboard situations, right? It’s such a light way of putting [00:18:00] something in front of someone and saying, am I on the right track? And by the right track, I don’t mean very vague of oh, I want an average, or I want to see a trend.

I Al I’m talking literally, I’m going to take these two columns and multiply them together. I’m going to get this third column. Can you use this third column? Show me how you’re going to use this term. Third column. So when I’m talking about prototyping, it’s really about how can it get something in front of people that they can respond to as quickly as possible.

It’s about getting that response and getting that conversation about something. And I almost think about this as being like UI UX. You can describe a design to me as many times as you want. But when you actually put even a very loose mock-up of the design before me, like it’s, we’re having a different conversation, there’s just no way to put it.

It’s a different conversation. And so it’s also worth saying, why do we need to do this? I think some really good pushback on this would be to say that, don’t we just need better requirements. Can we just get better at requiring requirements gathering? And I think [00:19:00] the answer to that is just, it’s just really hard.

It’s really hard to get people to talk about and articulate their problem. It’s much easier to give them something and say yes, no, does this solve your problem? Does this solve your problem? Does this solve your problem? This act of articulating the problem is really hard, which is, something a, an experienced product manager can tell you at length, it’s really hard to chase all these things and get to get to the bottom of some of this.

And again, I think the hands-on aspect puts this in this tactical conversation. It gets people on the same page, much, much more easily. And there’s a specific thing. I think that kind of trips up data people here it’s that we, our, so as data professionals, so invested in having our data be as close to perfect as possible.

Is it up-to-date what’s the data Providence? Is it tested? Is it documented? How fresh is it? All of this stuff, and that is super important, but that’s not what we need here. Right? People, I want to talk to people, a lot of this, they think about oh maybe I just needs to [00:20:00] build a prototype, really build the pipeline really fast.

Or maybe I need to have some very elaborate statistically fake data. You don’t, you just need. To drive the conversation for the data can be real. I’ve used real data from other product features that I think are representative or pull data from parallel data sets to look for data quality issues, but it really can be just some stuff you made up to drive the conversation in stakeholders like this.

They liked that you’re, they’re giving them something real that they can almost grab and say yes, build me that I can use that they don’t care if the data’s faked. And the point about part of this is that you can’t know if your idea will work. That’s not the bar here for useful prototype. You want to validate that it’s possible for your idea to work.

You want to say I’ve at least, cleared the road of the most obvious things that could go wrong. And so the things that I’m worried about is these failures that seem obvious in hindsight, Your data is too messy to solve the problem, right? No one just thought to look at it, your output doesn’t [00:21:00] completely solve your real user’s problem.

There’s a misunderstanding around requirements. You can’t connect the datasets. No one knows the answers. You can ask all these questions, but again, I think making something small, almost tiny physical putting in front of people, it’s worth a thousand words. So let’s skip to another story here.

[00:21:17] Another Story #

Alex Viana: I want to give another example of how you can work up building one of these tables. Potato types prototypes. Lauren’s getting an, a brand here. So this is something that my team is actually still working on and it’s not going to be a part or anything like that. This is really how we approach some of this.

And if I like it, because it shows how you can start with very little understanding of a problem space, very little work and get to something that I have found to be, generating much more valuable discussions with my stakeholders, saving a lot of time and work, not done. So the problem we’re going to talk about is annual recurring revenue, right?

So ARR is a really tricky beast because this is not a finance function. This is not getting your accounting in order for taxes or for an audit or [00:22:00] for anything like that. This is about how your business wants to recognize what it considers revenue, what it considers recurring revenue and how it wants to track that on an annual basis.

This is entirely particular to how your company wants to measure how it runs itself. It’s put another way. This is all corner cases. I think it’s an excellent use for prototypes. So the way I started working on this project, and to be honest, after dead editing a couple of times, as I just started from first principles, I drew up some fake companies. I drew up some fake contract value. And it was like, okay what do I need to calculate our here? I need a date. I need a contract start date. Okay. If we have a contract start date, we can start to make a time series. This is probably getting closer, but I know contracts and okay. So I’ll add an end date to this and put a filter on it.

Things drop off when the contract ends. And I also know from talking to salespeople that contracts don’t just start they’re signed and the company cares about when they’re signed. So we have this not just the start date for the service, but when [00:23:00] things are contracted, and so now when you have dates, when revenue is contracted, but not yet live, you have this thing called car contract. They are right. And so even from this simple point, like this is first principles, not anything. I already have a graph that I can start to maybe show some stakeholders and talk about.

And what they would tell me is, they would start asking questions about what’s the gap between car and ARR. Will you be able to measure that? Can you show me a graph that tells how many. For that time to be realized. And what about this one company that dropped out like they signed, but then they backed out because of some reason we’ll be able to capture that.

And what does that big drop-off there? What does that, is that contracts that have been actually canceled and not renewed? Is that churn because I need to be able to measure that or is that just the contract that hasn’t been it hasn’t been renewed yet and does your model allow you to track things that have been rude?

All of these, like these distinct points that they care about? It’d be really hard to get them to list out when they see something like this, they can [00:24:00] tie it into their subject matter expertise and start asking you specific questions of no, this isn’t right, but I need to see A, B and C and D and they would have left at least one of those things out of my experience had you not put this out?

You’re probably, if you were to do a ARR model , you’re going to end up with something that looks like this and all these different categories. Claire Carroll has a great article about how to spend this. But I think you really need to get these details. I think you need to get these, this prototype, these little corner cases, because it has wrecked my team because we made assumptions.

We did not do a good job of coordinating. When we started to go back to prototyping, he got money. So bringing you this back home now I’ve been shopping this idea around and one of the, for kind of the last year and one of the things I like about it, the feedback I’ve gotten from people is it resonates with different groups in different ways.

And I think that’s some evidence that there is something useful going on here. So my background is in science and I found that scientists really like this because this is compatible with the scientific method, what you’re really doing is you’re doing experimental design here and you’re trying to figure out, have I designed an [00:25:00] experiment?

Can I run an experiment that is going to answer address my hypothesis that I’m trying to investigate and it begets this questionable. Do I understand my hypothesis? Enough? Do I understand my experiment? Enough, not everyone is comfortable using this language, but I think there is elements of this, the kind of unpin this, or pin this even informally. I don’t know if there’s a lot of us former academics in in the data world. So you might come to the people who are thinking about this are familiar with it, product managers like this because this is very compatible with lean product methods. So if you’ve ever read Marty Cagan from Silicon Silicon Valley product group who wrote Inspired or Dan Olson who wrote The Lean Product Playbook or The Lean Startup by Eric Ries.

The school of thought is all about how quickly can our, can I turn this. 98 out of a hundred ideas are going to be bad. And my task is not to build a hundred things it’s to figure out which one of these things are, is, are bad as fast as possible. So being able to throw out quick prototypes and say, will this work instead of five complete pipelines really accelerates the value you can deliver to the company and product managers speak that language. They [00:26:00] get it.

Analytics engineers like us because underneath all this, what I’m doing, what I’m thinking about is I’m building a data model, right? If you subscribe to the idea of an analytics layer, this thing that I’m building, the spreadsheet that I have that underpins my visualizations, that I’m showing stakeholders is what I’m going to build to.

That’s what I’m going to write my tests to. That’s what, even what I might even see my tests with, like this is my internal set of requirement. That’s, an actual physical thing. I’m going to build two. And my analytics layers, part of my analytics engineering process and the executives like this because they get to be precise and explicit.

It doesn’t mean they don’t feel like they have to parse and belabor their language. They can just point to something and say that I want that even as just a silly thing we made in spreadsheets. In in the, in the world of magic the last part of a magic trick is called the prestige.

And so we have now arrived at the prestige. You have come to Coalesce to get away from spreadsheets perhaps to, I’ve heard people say they want to be the spreadsheet killer but much like a data [00:27:00] Lorax. I’m here to speak for the spreadsheets. I think there’s tremendous value in losing and using them to as these little white boarding exercises.

That’s all I had planned to cover. But I can’t go without thinking a bunch of people I’ve been working on this idea for about a year now in various forms. And so I have to acknowledge a lot of folks who helped me, Eric, Lou and Dave Connors sat through this whole talk and gave me a tremendous feedback and encouragement.

So thank Rosie Cardozo and the local Chicago meetup who sat through an abbreviated version of. The entire HealthJoy data team has had to put up with, me reenacting the crazy Charlie, always sunny meme, funky about this stuff all year. So that’s a Brendan, Fred Whitney and Joe.

Gleb over at DataFold allowed me to keep the first version of this talk at his data quality meetup. Back in April, it was very rough and very different. So thanks for taking a chance on me. And then my friends Claire, Ryan, and Jackie have all given me a feedback and encouragement on this.

Last modified on:

dbt Learn on-demand

A free intro course to transforming data with dbt