Balancing speed and quality in analytics engineering through automated CI checks: Insights from FINN's data team from Coalesce 2023
Chiel Fernhout from Datafold joins Jorrit Posor and Felix Kreitschmann from FINN to discuss balancing speed and quality in data engineering.
Trust is often built by how much pain your end consumer is feeling, and if they feel less pain, they trust you more easily, and that's something that you will build over time."
- Chiel Fernhout, software engineer at Datafold
Chiel Fernhout from Datafold joins Jorrit Posor from FINN GmbH and Felix Kreitschmann from FINN Auto to discuss balancing speed and quality in data engineering. Jorrit and Felix explain how their company, FINN, used automated checks to supercharge their analytics and data engineering. The group also discusses the journey of implementing these checks and the impact it had on FINN’s team and organization.
The importance of automated data testing tools for maintaining data quality and speed
The team members from FINN discuss the challenges they faced in maintaining data quality and speed in their fast-growing company, and how the use of automated data testing tools like Datafold helped. "We're in what you call a high-growth scaleup...," says Senior PM of Data, Felix.
Datafold, which is an automated testing data tool, is highlighted for its ability to compare two versions of a dataset to identify differences, thereby ensuring data quality as the company grows. Chiel, a software engineer, notes, "Datafold is an automated data testing tool which integrates nicely throughout your workflow. What it does is it compares two versions of a dataset to identify the differences…It's like a Git diff, but for your data."
The group emphasizes the importance of having embedded data teams that work specifically on one domain, such as sales or marketing. These teams were able to accelerate insight generation within their departments, build up knowledge for future projects, and contribute to the speed and quality of the company's data journey.
The implementation of automated checks to ensure code quality and validity
The participants underscored the value of implementing automated checks to ensure code quality and validity. These checks were executed in GitHub, and they used tools such as SQLFluff and DataFold for code linting and identifying data differences, respectively.
They also elaborate on how Datafold checks for the impact of changes on downstream models during pull requests, with Tech Lead of Data Engineering Jorrit stating, "Datafold gives you some kind of interface during pull requests that says, 'Hey, this is the impact of your change, and this is the impact on every downstream model. Please look at it.' And what you get is basically confidence."
The need for cultural changes when introducing new rules and policies
Jorrit, Chiel, and Felix emphasize the need for cultural changes when introducing new rules and policies for data management, underscoring the importance of supporting channels and quick response times to reduce friction and maintain the speed of operations.
"Culture-wise, I think it's super important to have some kind of support channel, so whenever someone is blocked by a policy or a CI check that the person does not understand, the person should not be blocked longer than 15 minutes or an hour," says Jorrit.
Jorrit also stresses the need to ensure that the rules and policies being implemented are understood and supported by the team members. He points out that, "The people part is really, really huge here. It's not only a technical solution that you can just throw over the fence and say ‘Here. Deal with it.’ It doesn't work."
Chiel, Jorrit, and Felix's key insights
- Implementing automated checks can help balance speed and quality in data engineering
- These checks can reduce friction in the workflow and increase trust in the team's work
- It's important to have support channels in place when introducing new policies or changes in the workflow
- Tools like SQLFluff and Datafold can help in maintaining code quality and checking downstream dependencies, respectively
- Slowly introducing quality checks and growing them over time can be an effective strategy
Automation in dbt for large-scale operations from Coalesce 2023
Benoit Perigaud, Staff Analytics Engineer at dbt Labs, explains how to avoid chaos when scaling dbt projects.
Operational AI for the Modern Data Stack
The opportunities for AI and machine learning are everywhere in modern businesses, but today's MLOp...