In the fast-paced world of data engineering, efficiency and coherence are paramount. Databricks Asset Bundles (DABs) have emerged as a powerful tool to streamline the deployment and management of Databricks projects. However, to truly harness their potential, teams must adopt best practices that ensure smooth, simultaneous development and deployment. In this blog, we’ll discover the top strategies for leveraging Databricks Asset Bundles effectively, ensuring your team stays on track and your projects remain scalable and maintainable.
Databricks Asset Bundles (DABs) are a structured way to package and deploy Databricks resources, including notebooks, workflows, and configurations. They enable teams to version-control their projects, automate deployments, and maintain consistency across environments. Think of them as a blueprint for your Databricks projects, ensuring that every piece of the puzzle fits together seamlessly.
Forget long complex CI/CD pipelines, deploy project code by running one command. Simply run `databricks bundle deploy
` to deploy notebooks, pipelines, workflows and ml models to databricks.
A key design pattern in Data Engineering projects is to use at least two environments, Development and Production (with maybe an extra Staging/Test/UAT). By default, when creating a DAB by using the CLI running a `databricks bundle init
` command. There will be a target created called ‘dev’ that looks like so:
The issue here, is that we may expect the target named ‘dev’ to correspond to the development branch and host the code for our development release. However, by default this is not the case.
The "mode:development" in DABs prepends resource names with a dev prefix including the developers name, allowing deployments of the same code for each team member. This is not what we want to happen for the code corresponding to the development branch.
In contrast, a “mode:production” enforces stricter validations and typically uses service principals for deployments, so there is one code deployment per target. Developers cannot directly publish to these targets.
A way to overcome this is to define a “user” target, which acts as a predevelopment environment. This target should be marked as `mode:development`
in the YAML and all other targets set to production. This means that we can deploy our development code to mirror the development branch in source control through CI/CD. I’ve seen this confuse many people. Just think as mode:production as deploying through a service principal. So instead of the above, you should have something that looks more like:
As we witness above, you can leverage environment-specific variables and configuration inside of the Databricks.yml file, managing configurations for different deployment environments in one place. This ensures that your code behaves consistently across development, staging, and production environment, whilst minimising code changes. Enable automated pipeline triggers in production environments whilst keeping development environments ad hoc.
Key Configurations to Include:
Remember that defining any YAML inside of the Databricks.yml file will overwrite whatever is written in the underlying files. Which enables us to overwrite with environment specific configuration.
Find out more about parameterising asset bundles on my previous blog here.
Version control is crucial for Databricks Asset Bundles. Use Git (and GitFlow for efficient branching) to track changes, enable rollbacks, and facilitate collaboration. Two main strategies exist:
The best choice depends on your project's structure: a single repo for tightly coupled workflows, separate repos for independent ones.
Arguably the biggest benefit of using DABs is the ability for developers to easily deploy asset bundles with one line. Use this feature of DABs to transform code reviews by enabling easy, isolated deployments. Simply run `databricks bundle deploy`
on the feature branch in review to deploy a complete, named instance of your changes. This allows for:
Make this a standard part of your pull request process for better code quality and team alignment.
A common request I get working with clients new to DABs is:
“Can I make edits directly in deployed Databricks notebooks or workflows? This feels faster than going through a DAB deployment."
Often coupled with... "It's only a small change..."
Although it's tempting to quickly fix issues directly in deployed Databricks notebooks. However, this seemingly minor shortcut introduces significant risks:
Always develop and test in your local environment or a version-controlled system before deploying via Databricks Asset Bundles. This ensures traceability, consistency, and maintainability.
Integrate Databricks Asset Bundles into your CI/CD pipeline to automate deployments. Tools like GitHub Actions, Jenkins, or Azure DevOps can be used to trigger deployments whenever changes are pushed to specific branches.
Sample CI/CD Workflow:
databricks bundle deploy
` command as part of the CI/CD.By integrating the DAB release into your existing CI/CD we shorten the development cycle for a developer. But we also ensure on a project level that code is being tested, and we can ensure the requirements for code quality are upheld throughout the Databricks workspace.
When multiple developers are working on the same Databricks project, conflicts and inconsistencies can arise. Here’s how to avoid them:
Remember, the key to success lies in consistency, automation, and clear communication. Start implementing these best practices today and watch your Databricks projects thrive!
If you want to understand how to speed up your teams Databricks deployments, reach out to us!