Introduction
Databricks Asset Bundles (DAB) are a way to package and deploy data engineering and analytics artifacts in a structured, repeatable, and version-controlled manner. They address common challenges in deploying and managing Databricks resources in large, collaborative environments. DABs essentially streamline the deployment of Databricks artifacts, improving both efficiency and reliability in modern data engineering projects.
They allow you to bundle notebooks, workflows, configurations, and dependencies into a single deployable unit. Asset bundles support parameterisation and configuration management, making it easy to tailor deployments for specific environments or use cases.
Specifically, with Delta Live Tables (DLT), using asset bundles can reduce some of the complexity with deploying pipelines to higher environments and reduce the development overhead enabling developers more time to do the important tasks.
The Problem
One feature of DABs is the ability to enable developers to have a copy of the code and push this to their development workspace without any code changes. Pipelines get prefixed with the developer’s name and is stored in their user workspace. Meaning when working on features and bugs, code can be pushed to this space and doesn’t require going through CI/CD processes (saving time!). This means that from the command line I can simply run the following: `databricks bundle deploy
` and a copy of the code will be deployed into my user area of Databricks, with copies of any pipelines prefixed with my name like so:
As these pipelines are defined by yaml code, this means when the pipelines that have been pushed through CI/CD processes using the same yaml, they will exist with either a prefix or just as the name of the pipeline depending on how the Databricks.yml file has been configured. This means that there are now multiple pipelines, all of which have the same target table.
DLT limitation
When a DLT pipeline creates a table, this table can only be written to by the pipeline that created it. If another pipeline tries to write data to this table the pipeline will error.
If we have the pipeline from above dlt_bronze_ingest
is writing data to my_first_table, then when a developer tries to deploy this to their workspace and run the pipeline (if this has already been deployed in the Dev workspace, or a second developer is testing changes), they will hit an error:
Therefore, if developers are working in the same development workspace (which they will be 99% of the time) then this will result in errors when running the pipelines and cause delays in development, as tables will need to be deleted or fully refreshed on every run, which is not sustainable.
CI/CD with DABs
In production systems we will want to deploy our asset bundles through CI/CD. However, in the shared development workspace (as developers are going to be working here), similar issues arise: when a developer deploys their pipeline and CI/CD deploys the DAB this results in multiple pipelines trying to write to the same table. And as we know this results in a `failed to update` error.
DAB: Example DLT Pipeline Code
The above is an example of a simple pipeline definition for a DLT pipeline, written in yaml. Key areas are that we are saving to the bronze schema in Unity Catalog (line 10). And have between 4-8 fairly large workers (line 28-29) with photon enabled (line 33) which is great for processing large amounts of data (often found in production rather than development workspaces).
This yaml ensures that developers can repeatably deploy and test their work in their workspace by running `databricks bundle deploy
`. The code can also be deployed by CI/CD processes. As explained previously developers and CI/CD will both be deploying this code to the development workspace. This means multiple copies of the pipeline will exist (writing to the same tables). Therefore the failed to update error will be hit when a second developer tries to run their deployed code.
The Solution
We can overwrite DAB assets’ configuration within the Databricks.yml. This is a very powerful feature that can be used to change any settings based on the target environment. A common deployment approach is the three common environments: Dev, UAT/Test and Prod. This means we can change any configuration of assets for each of the environments if needed.
As these environments are generally standard practice, I suggest adding an additional target that lives outside of CI/CD and will only be used for these local developer deployments.
Substitutions
Within the asset bundles there are substitutions that we can make use of like:
${bundle.name}
${bundle.target} # Use this substitution instead of ${bundle.environment}
${workspace.host}
${workspace.current_user.short_name}
More info on these can be found here.
Development & Production Mode
When defining a target in `Databricks.yml
` there is a mandatory attribution named `mode
` which must either be set to `development
` or `production
` so what are the differences between these two?
Development Mode
- Designed for individual contributors testing changes in their personal workspaces.
- Enables deployment directly from the CLI (databricks bundle deploy).
- By default, deploys to a user-specific folder, avoiding conflicts with shared or production workspaces.
- Ideal for rapid iteration and debugging without impacting other users.
- Deployment paths are flexible (e.g.,
/Users/<username>/workspace
), allowing developers to test without affecting shared resources.
Production Mode
- Intended for stable deployments managed by a service principal or a designated individual.
- Ensures consistent deployments to shared production environments, typically through CI/CD pipelines.
- Follows stricter configurations to align with enterprise standards, like deploying to defined production paths and ensuring robust security controls.
- Deployment is centralised (e.g.,
/Workspace/production/.bundle/
) and often automated, ensuring reliability and compliance with production standards.
If you’ve got a keen eye, you may be already piecing together what the solution may look like by combining these two concepts.
Deployment Example
We are using three Databricks workspaces one for each environment Dev/Test/Prod. These will be linked and released based on the branch the code lives. Developers will have access to the dev workspace and this is where they can test their work.
We need to define a target in the Databricks.yml file that is going to be set to Development mode. By default, when initialising an asset bundle this will be the `dev
` target. However, in my use case I wish to have the three Dev/UAT/Prod environments managed by service principals and automatically deployed through CI/CD. Therefore, I will register a separate target that developers will use, I will call this `user
`. By using substitution to add the username into the target schema for DLT pipelines. We ensure that the pipeline and table are isolated for developers. As this substitution only happens on the user target. When we push changes through CI/CD this will only release to the Dev/UAT/Prod targets. So what does this look like in DAB?
Within Databricks.yml we’ve defined a new target called user
. This is set to `mode: development
` (meaning the deployment paths are flexible) and contains two pipelines in which we overwrite the target. In our new target name we include the `workspace.current_user.short_name
` which results in a catalog being created for every developer deploying the asset bundle.
This is the way to overcome the main challenge imposed by DLT mentioned at the start of this blog. Enabling developers to work simultaneously, and protecting the "source of truth" development tables.
Further refinements can be made, such as reducing the cluster sizes for these deployments and removing photon (as shown in the previous screenshot). As we want to reduce costs especially when in large teams for development purposes, especially given that often times development workspaces have access to less volumes of data. We could also even switch to serverless for the user deployments if your use case will allow this.
Example of other changes can be made where we introduce custom configuration and again adapt the cluster size to the `dev
` environment are:
Conclusion
By following the principles described above, you can be assured that your deployments follow the best CI/CD practices and that code that has been deployed into any environment has been tested (by deploying through CI/CD using SPNs). But you also have enabled your dev team the space to work efficiently and reduce complex deployment overhead to test their code.
If you’ve been struggling with DLT deployments, implementing Databricks Asset Bundles or want to chat to us relating anything Databricks, reach out to us and one of our experts will be able to help get you on the right tracks.
Topics Covered :
Author
Jordan Witcombe