Optimizing data pipelines for quality and governance with dbt and Datafold from Coalesce 2023

Gleb Mezhanskiy, Ravi Ramadoss, and Ryan Kell discuss how Datafold is used to automate data testing.

"We use Datafold to make sure that we are also looking at the impact of a code change, not only the actual code change."

- Ravi Ramadoss, Director of Data Engineering of Moody’s Analytics CRE

Gleb Mezhanskiy, CEO of Datafold, and team members from Moody’s Analytics CRE, Ravi Ramadoss, Director of Data Engineering, and Ryan Kelly, Data Engineer, discuss how Datafold is used to automate data testing and improve data quality. They share their experiences implementing Datafold in their data engineering workflows and how it’s helped them identify and resolve data issues more effectively.

Moody's Analytics utilizes data testing to ensure the accuracy and reliability of its data

The team at Moody's Analytics shares their experience with data testing, emphasizing the importance of data accuracy within their organization. They utilized Datafold to automate their data testing process, which significantly impacted their workflow and business output.

Gleb, the founder of Datafold, shares a personal anecdote from his data engineering days, where a two-line SQL filter he added caused significant issues with company dashboards. "The most interesting fact about this story is not that I was able to corrupt hundreds of tables with a two-line SQL code change, but it took us four hours in the war room, full of senior data engineers, myself included, to actually correlate that change I made to the anomaly we're seeing.”

This incident gave Gleb the realization that it's easy to disrupt data systems, but it's challenging to identify issues once bad data is in production. Ravi and Ryan of Moody's Analytics share similar experiences, discussing how they use Datafold for data testing. Ravi highlights the platform's ability to enrich and transform property-level data, while Ryan emphasizes the increased transparency and confidence in their data changes thanks to Datafold.

dbt tests and Datafold play complementary roles in data testing and quality control

The speakers also highlight the difference between dbt tests and using Datafold. While dbt tests are useful for codifying assumptions about the business and testing for the most important aspects of the data, Datafold provides transparency on the whole range of changes and impacts, including those that dbt tests may miss.

"We have a big slate of automated tests with dbt," explained Ravi. "The main thing we are talking about today is mainly about those checks that go unchecked… that's where Datafold is helping." Gleb adds, "Datafold captures this enormous long tail of things that can actually go wrong and helps you cover the entirety of your changes."

Gleb elaborates further, stating that "the code coverage [of dbt tests] is aspirational," and it's impossible to cover every possible scenario. He explains that Datafold doesn't replace dbt tests, but supplements them by providing visibility into all changes, regardless of whether dbt tests have been written or who has changed the code.

Implementing data testing has improved developer productivity and overall business processes

The team at Moody's Analytics experienced improvements in developer productivity and overall business processes after implementing data testing using dbt and Datafold. They emphasize that having a robust data testing strategy not only caught potential issues before they became problems but also significantly improved their confidence in the data they were delivering.

"From an engineering perspective, with 99% certainty, I know I'm not going to break anything which is really great for me," Ryan says. He adds, "Datafold brings more transparency into this. We're able to test things way more... I'm more confident in the changes, so the business is also more confident because they know what's the downstream impact."

Ryan also emphasized that their process of data testing had democratized who could review the PRs. With the added transparency, they were no longer dependent on a small number of people with domain knowledge, and everyone could contribute to various parts of the code base.

Insights surfaced

  • Datafold automates data testing, ensuring the correctness of data shipped to stakeholders
  • It allows for a detailed review of data changes, helping to identify potential issues before they become larger problems
  • Implementing Datafold has increased the quality of PRs, improved confidence in code changes, and reduced data issues
  • Datafold has improved transparency in data changes, making it easier for those without domain knowledge to review PRs
  • The tool has also helped democratize data, making it easier for anyone to contribute to various parts of the codebase
Related Articles

Register for Coalesce 2024

Join us in-person or online for the largest analytics engineering conference. Level-up your skillset, expand your network, and build your path at Coalesce 2024.