Maintaining and scaling your dbt repository: Best practices from EQT's data team from Coalesce 2023

Erik Lehto, senior analytics engineer at EQT, discusses his experience ‌managing a growing data warehouse and the challenges created by rapid growth.

"When you have a very large repository and you have thousands of tests, just getting an alert when something fails is not going to be enough."

Erik Lehto, senior analytics engineer at EQT, discusses his experience ‌managing a growing data warehouse and the challenges his team faced due to rapid growth. He also shares his insights on how they managed to refactor their data repository and improve their data management practices.

Large repositories require strict coding processes and performance optimization

Erik shares that when developing large repositories, it becomes more complex to maintain code and ensure high performance. EQT had to implement stringent coding and review processes to prevent bad code from being pushed into production.

"Back when I started using dbt four years ago, I was so excited that it was so easy to go from raw data in your data warehouse to a data product… but now when you have a repository with over 2,000 models, it's not so easy anymore," Erik explains. He stresses the importance of improving the developer experience, ensuring a good review process, and finding and eliminating performance bottlenecks.

According to Erik, “focusing on the code ownership and making sure that each domain has their review practices" will prevent bottlenecks and ensure a quality developer experience. Additionally, using model timing capabilities in dbt Cloud to find outliers and improve parallelism can greatly enhance performance.

Improve observability and use exposures

Erik emphasizes the importance of improving observability and the use of exposures, especially in large repositories where thousands of tests are run. The ability to track past failures and the overall health of the data warehouse is crucial, and he suggests using the open-source package, elementary, to persist run and test results.

"We use the Elementary package. So, this is an open-source package that you can use to persist run and test results to your data warehouse," Erik explains. "When you have tests that are failing often... review if the tests are really necessary or if they just contribute noise."

He also suggests using exposures to quickly find downstream dependencies when deprecating models. However, he cautions that exposures would only be as good as the data in them, meaning they need to be periodically reviewed and updated to ensure accuracy.

Simplifying CI/CD and limiting incremental materialization

When dealing with a large repository, Erik advises simplifying continuous integration and continuous deployment (CI/CD) to reduce complexity. He also recommends limiting the use of incremental materialization, except when absolutely necessary, as it could add unnecessary complexity.

"Try to make your branching strategy and your CI/CD only as complex as it absolutely needs to be... it's hard to get everyone on board…following the same principles, following the same guidelines, building their models in a consistent way," Erik explains.

Regarding incremental materialization, he advises, "Don't use incremental materialization unless it's necessary. ... Incremental materialization is great when you need it, but you should follow the best practices from dbt Labs. Start with a view…if that doesn't cut it performance-wise, move to a table, and only then should you go to an incremental materialization, because it's very easy to underestimate the complexity that an incremental strategy can add."

Erik’s key insights

  • It's important to make it easy for developers to contribute to the data repository, but there should be strict guidelines and a review process to maintain quality
  • Performance bottlenecks need to be identified and eliminated to improve the efficiency of the data warehouse
  • The branching strategy and CI/CD should be kept as simple as possible
  • Exposures in dbt are crucial for identifying downstream dependencies and models that are not producing any business value
  • Improved observability is necessary to track failed tests and the overall health of the data warehouse
  • Incremental materialization should be used only when necessary due to its complexity
Related Articles

Register for Coalesce 2024

Join us in-person or online for the largest analytics engineering conference. Level-up your skillset, expand your network, and build your path at Coalesce 2024.