loader

Databricks Serverless: Simplifying Compute in the Lakehouse

Overview

When working with data engineering, analytics, or machine learning workloads, infrastructure management often takes up more time than it should. Enter Databricks Serverless, a fully managed, on-demand compute environment designed to reduce overhead and allow you to focus on what really matters—working with data.

Whether you're running Serverless SQL Warehouses or the newly introduced Serverless Jobs (in preview), these compute options automatically scale with your workloads, minimising idle costs and making scaling infrastructure effortless.

Key Features of Databricks Serverless:

  • Automatic Scaling: Dynamically adjusts resources to meet the demands of your workloads.
  • Fully Managed: Say goodbye to configuring, scaling, or maintaining clusters.
  • Cost-Efficient: Only pay for what you use—perfect for intermittent or unpredictable workloads.

Why Use Serverless? Identifying the Best Fit

Serverless compute is ideal for workloads that are intermittent, require rapid scaling, or benefit from low operational overhead. Here are a few scenarios where Serverless shines:

  1. Ad-Hoc Analytics
    Running sporadic SQL queries? Serverless SQL Warehouses are ideal. They spin up quickly (under 10 seconds) and terminate when idle, ensuring cost-efficiency. In contrast, provisioned clusters take longer to start (less than 5 minutes) and require manual management, which can lead to higher costs for idle resources.
  2. Scheduled ETL Pipelines
    Automate your ETL processes with Serverless Jobs, which allocate resources just for the duration of the task and shut down automatically when the job completes.
  3. Interactive Exploration
    Need a quick session to analyse a dataset? Serverless Compute for Notebooks provides resources on-demand, so you can dive into data exploration without worrying about cluster setup.

What About Networking?  Meet NCC

One of the standout features of Serverless compute is its support for Network Connectivity Configurations (NCC), which allow you to connect securely to your Azure resources. NCC ensures private, managed connectivity between Databricks Serverless compute and your data sources, such as Azure Data Lake Storage (ADLS).

How NCC Works

  • NCC leverages Azure Private Link to create managed private endpoints.
  • This ensures that all communication between Serverless compute and your Azure resources happens securely, without exposing traffic to the public internet.

Limitations to Consider

  • No On-Prem Connectivity: NCC cannot directly connect to on-premises systems. If this is a requirement, consider using tools like Azure Data Factory for data replication.
  • Azure-Native Only: NCC is tailored for Azure services (ADLS, Blob Storage, Cosmos DB). Connecting to non-Azure services will require alternative solutions.

Pro Tip: If your workflows depend on on-premises data sources, explore hybrid solutions like Azure Arc to bridge the gap.

Managing Dependencies Without Init-Scripts

A common question we hear is: What happens to init-scripts in a Serverless world? In traditional clusters, init-scripts allow you to pre-configure environments. Serverless removes this level of access, but there are some great alternatives:

  1. Notebook-Scoped Libraries
    Use %pip install to dynamically install Python libraries directly within your notebooks. 

    For example: %pip install /dbfs/FileStore/wheels/my-library-0.1-py3-none-any.whl
  1. Jobs API
    For tasks outside of notebooks, define library dependencies dynamically in your job configuration.

{

"libraries": [

   { "pypi": { "package": "pandas" } },

   { "whl": "dbfs:/path/to/my-library.whl" }

     ]

}

  1. Centralized Management with Unity Catalog
    For larger teams, manage shared resources and dependencies centrally using Unity Catalog, ensuring consistency and governance.

Hydr8 and Serverless

At Advancing Analytics, we’ve been exploring how Serverless fits into our Hydr8 framework. While Serverless simplifies infrastructure management, transitioning from traditional clusters requires some adjustments:

  • Replacing Init-Scripts: We now use notebook-scoped libraries and Jobs API for managing dependencies.
  • Adapting to Shared Clusters: Serverless shifts from single-use clusters to shared environments. Using Unity Catalog has been key to ensuring secure, multi-user access.
  • On-Prem Data: For on-prem data sources, we've explored solutions like Azure Data Factory and hybrid architectures to bring data into Azure for seamless processing.

Are There Drawbacks?

While Serverless compute is fantastic for many use cases, it’s not a one-size-fits-all solution. Here are a few limitations to keep in mind:

  • Logging and Monitoring: Spark logs and the Spark UI are not available.
  • External Data Ingestion: Because serverless compute does not support JAR file installation, you cannot use a JDBC or ODBC driver to ingest data from an external data source.
  • Limited Customization: No direct access to cluster-level configurations like Spark settings or init-scripts.
  • Higher Costs for Continuous Workloads: For always-on tasks, provisioned clusters might be more cost-effective.

Wrapping Up

Databricks Serverless is an exciting step toward simplifying data engineering and analytics workflows. By reducing the need for manual infrastructure management, it allows teams to focus on building value, not managing compute.

At Advancing Analytics, we see Serverless as a game-changer for dynamic, on-demand workloads. Whether you’re optimizing ETL pipelines, enabling ad-hoc analytics, or scaling exploratory analysis, Serverless offers flexibility, efficiency, and cost savings.

Ready to streamline your data engineering and analytics? Visit our Analytics page for more info or contact us today to learn how we can help you leverage data analytics for your business.

References

  1. Databricks Serverless Compute Limitations
  2. Azure Private Link for Serverless
  3. Best Practices for Serverless Compute
  4. Cost-Effective ETL with Serverless
author profile

Author

Mo Uddin