Liquid Clustering (LC) was designed to replace table partitioning and the ZORDER command, to simplify data designs and decisions whilst also optimising query performance. LC also provides further flexibility by allowing the cluster keys to be refined without the need of rewriting the underlying data.
Table partitioning is a technique to optimise large scale processing and queries. Partitioning as the name suggests divides the underlying data files of a table into several distinct segments based on specific columns. Doing so allows queries to only scan the relevant data file. Z-Ordering further enhances this by co-locating related data in close physical proximity on disk, based on multiple columns benefiting from data skipping.
Sounds great, so why the need to change? Misusing table partitioning and Z-Ordering can lead to serious performance pitfalls. Over-partitioning, where too many small partitions are created, can overwhelm the system with excessive metadata and small files, causing slower query performance and higher storage costs due to low compression rates. On the other hand, poor partitioning choices might result in uneven data distribution, leading to data skew and inefficient resource utilisation. Similarly, misapplying Z-Ordering like using it on columns with low cardinality or irrelevant to query patterns—can increase processing overhead without delivering performance benefits.
Finally, if the way your data is accessed changes, and queries no longer benefit from the current partition columns, or a more efficient partitioning method is identified, the data layout becomes an issue. Since your data is physically stored in subfolders based on these partitions, altering the partition columns requires a full data rewrite, making the process time-consuming and resource-intensive
Liquid Clustering in Databricks is designed to address the limitations mentioned above by dynamically optimising data layout as new data is ingested. Unlike static partitioning, which can cause data skew and degrade query performance over time, Databricks offers an "out-of-the-box, self-tuning data layout that automatically scales with your data". Liquid Clustering continuously reorganises data based on query patterns and data changes. It intelligently groups data with similar access patterns together, ensuring that queries remain fast even as data evolves. This adaptive approach reduces the need for manual tuning, prevents performance degradation due to data skew, and optimises storage, making data management more efficient and responsive.
This lean towards a more managed approach is one we are familiar with in the Databricks world. Allowing you, the data engineers to spend more time on the important tasks whilst the platform will figure out the most optimal data layout (without the need for data rewrites). When deploying large data tables, it is easy to spend lots of time fine tuning which can get quite expensive both in time and compute costs. Given time constraints maybe you couldn’t find the most optimal set up for your data leading to queries taking longer than necessary.
Liquid Clustering has been shown to improve write speed by up to 7x from Databricks internal testing. As well as query time improvements anywhere from 2x to 12x. These improvements are really quite something and not to taken lightly. Imagine stakeholders being shown a dashboard that is 12x faster, that’s good news for everyone!
Below are some of the key requirements to enable and cluster your tables efficiently with Liquid Clustering:
Enabling LC on either a new or unpartitioned delta table could not be easier. See how to set this up below in SQL and Python:
Alternatively, another way to run these automatic optimisations is to enable ‘predictive optimizations’. This is a feature that must be enabled at the account level so may not be suitable for all use cases as it will require an account admin to enable this. Once enabled on the account level, it can be applied to a catalog and schema by the relevant owners. Serverless compute must be enabled on your account (with relevant networking rules applied if in a private network).
Once these have been enabled, Databricks will identify tables that will benefit from these management commands (optimization and vacuum). These will only be run when deemed necessary which removes unnecessary runs and the manual burden of taking time to optimise tables. More information on predictive optimisation can be found in the Databricks documentation here.
Dynamic vs. Static Structure:
Granularity:
Efficiency:
Do we think you should use Liquid Clustering? The short answer is: Yes. Databricks are recommending using LC for all new delta tables. As described in this blog, doing so does not add any complexity for the data engineers (in fact it removes it!), it improves the performance of tables for downstream users and offers for the flexibility to change keys if the needs of the data change over time.
Have tables you feel would benefit from this improvement in performance? Reach out to us here at AA to discuss how we may be able to help you, and your organisation optimise your platform performance.