loader

Databricks SDK for Python, a REST API Replacement

If you're reading this, then it's likely that at some point you've needed to programatically interact with a Databricks instance to make a modification of some kind. It's also likely that you've encountered one or more of the potential pitfalls that come with such a task. Thankfully, the Databricks SDK for Python is here to save the day.

The Databricks REST API

To programmatically interact with a Databricks instance, such as to modify a catalog configuration, developers sometimes choose to use the Databricks REST API. This process has some drawbacks, however, as breaking changes are made to the API on a regular enough basis to create considerable work when maintaining a code base.

Anecdotally, I know several developers at Advancing Analytics who have had to support clients to fix problems caused by these changes.

Maintainability of a REST client can present its own set of difficulties, even without breaking changes.

  1. Authentication is manual, requiring a bearer token be provided at runtime
  2. Authentication may differ between developer environments and deployed instances, requiring manual management
  3. URLs have to be generated at runtime, presenting a route of failure
  4. Failures have to be mapped manually through response status codes

The Databricks SDK for Python, a Better Alternative?

Thankfully, Databricks now provides the [Databricks SDK for Python](https://github.com/databricks/databricks-sdk-py) as an alternative. This library provides a set of interfaces that allow developers to programatically iznteract with a Databricks instance in a reliable and maintainable way.

Databricks does warn that the SDK is currently in Beta, which should be taken into consideration before migrating, however they do state that the SDK "is supported for production use cases", and provides feature parity with the RESP API.

The SDK also provides some extra utilities, such as allowing for easy authentication on developer machines and automated authentication on notebooks run inside a Databricks instance. Below is an example showing how the SDK might be used for updating a catalog's isolation mode.

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import CatalogIsolationMode
def set_catalog_isolated(): client = WorkspaceClient() catalog = client.catalogs.update( name="<catalog_name>", isolation_mode=CatalogIsolationMode.ISOLATED )

By utilizing the Python SDK instead of the REST API, not only have we simplified the requirements to perform the operation, but handling errors is more dynamic. An added benefit to this approach is error handling. A REST API client must process a request and then return any error messages generated, which a developer must then parse from the message body and handle manually.

With the SDK, errors defined in the library are raised automatically, allowing for better exception handling and control flow.

For more information on error handling, see the official GitHub repository's 'Error Handling' section.

Authentication and Authorization

In my opinion, the single biggest motivator for moving from the REST API to the SDK is how authentication and authorization are handled. Authentication through the Databricks REST API requires manual management of a bearer token, which presents a natural security risk should a token be made public.

The Python SDK provides a much easier way to manage authentication and authorization. When running locally, a developer need only ensure they have a .databrickscfg file in their home directory, containing the target Databricks instance's host and a generated Personal Access Token. No modifications need to be made to code to authenticate within a Databricks instance. This process is handled automatically.

For more information on authenticating with a Databricks config file, see this documentation.

If a developer runs a notebook from inside the Databricks instance then authentication will be handled automatically, with an access token generated as required. This token is then expired automatically. This process also guarantees developers the same level of access through the SDK they would have from inside a Databricks instance.

If a notebook is triggered through a service principal the same process occurs. Databricks will generate an appropriate token at runtime which is then expired when the notebook terminates. This ensures that the access granted to the service principal triggering the notebook is only scoped to resources it would normally have, rather than whatever access the previously generated token stored in the Key Vault has access to.

Caveats and Considerations

The biggest caveat when considering using the Python SDK is stability. The Databricks Python SDK is currently in Beta, which, given stability is the primary motivation for moving away from the REST API, paints a serious negative on the entire concept.

However, I believe this problem is less severe than it may first appear. Although the SDK is in Beta, Databricks advise that it is safe for use in production deployments.

Also worth bearing in mind is that, traditionally, Beta for a library of this nature would typically indicate a more mature project that is in the process of being polished for a full release, rather than a newer, less feature complete interface.

The future of both options is also a serious consideration. The REST API has shown itself to be inconsistent and unreliable, and there's no plan for this to change any time soon, whereas the SDK promises improved stability in the future.

author profile

Author

Jake Ratcliffe