Liquid Clustering 101 - How you should be storing & optimising your data in 2024

Written by Jordan Witcombe | Oct 18, 2024 10:46:47 AM

Liquid Clustering (LC) was designed to replace table partitioning and the ZORDER command, to simplify data designs and decisions whilst also optimising query performance. LC also provides further flexibility by allowing the cluster keys to be refined without the need of rewriting the underlying data.

So what is Table Partitioning and why did it need to be replaced?

Table partitioning is a technique to optimise large scale processing and queries. Partitioning as the name suggests divides the underlying data files of a table into several distinct segments based on specific columns. Doing so allows queries to only scan the relevant data file. Z-Ordering further enhances this by co-locating related data in close physical proximity on disk, based on multiple columns benefiting from data skipping.

Sounds great, so why the need to change? Misusing table partitioning and Z-Ordering can lead to serious performance pitfalls. Over-partitioning, where too many small partitions are created, can overwhelm the system with excessive metadata and small files, causing slower query performance and higher storage costs due to low compression rates. On the other hand, poor partitioning choices might result in uneven data distribution, leading to data skew and inefficient resource utilisation. Similarly, misapplying Z-Ordering like using it on columns with low cardinality or irrelevant to query patterns—can increase processing overhead without delivering performance benefits.

Finally, if the way your data is accessed changes, and queries no longer benefit from the current partition columns, or a more efficient partitioning method is identified, the data layout becomes an issue. Since your data is physically stored in subfolders based on these partitions, altering the partition columns requires a full data rewrite, making the process time-consuming and resource-intensive

Introducing Liquid Clustering

Liquid Clustering in Databricks is designed to address the limitations mentioned above by dynamically optimising data layout as new data is ingested. Unlike static partitioning, which can cause data skew and degrade query performance over time, Databricks offers an "out-of-the-box, self-tuning data layout that automatically scales with your data". Liquid Clustering continuously reorganises data based on query patterns and data changes. It intelligently groups data with similar access patterns together, ensuring that queries remain fast even as data evolves. This adaptive approach reduces the need for manual tuning, prevents performance degradation due to data skew, and optimises storage, making data management more efficient and responsive.

This lean towards a more managed approach is one we are familiar with in the Databricks world. Allowing you, the data engineers to spend more time on the important tasks whilst the platform will figure out the most optimal data layout (without the need for data rewrites). When deploying large data tables, it is easy to spend lots of time fine tuning which can get quite expensive both in time and compute costs. Given time constraints maybe you couldn’t find the most optimal set up for your data leading to queries taking longer than necessary.

Liquid Clustering has been shown to improve write speed by up to 7x from Databricks internal testing. As well as query time improvements anywhere from 2x to 12x. These improvements are really quite something and not to taken lightly. Imagine stakeholders being shown a dashboard that is 12x faster, that’s good news for everyone!

How do I get my hands on it?

Below are some of the key requirements to enable and cluster your tables efficiently with Liquid Clustering:

To enable LC a ‘CLUSTER BY [column/s]’ statement must be added at table creation.
It is recommended that the cluster key columns are those which are commonly used as query filters.
Up to 4 columns can be used as cluster keys (Need to have delta statistics collected - so need to be in the first 32 columns of the table. Or this property changed here).
Many data types can be used as cluster columns and can be found here.

Enabling LC on either a new or unpartitioned delta table could not be easier. See how to set this up below in SQL and Python:

Once these LC is enabled, just like when using partitioning, these clustered tables require optimising to ensure they’re stored efficiently. The same ‘OPTIMIZE’ command is used (and Databricks recommends running this regularly). Alongside this there are several automatic optimisations that take place. If writes meet a certain threshold, then automatic clustering on write will occur. The values of these change based on whether your tables in Unity Catalog are managed tables, and can be found below (as of August 2024):

Alternatively, another way to run these automatic optimisations is to enable ‘predictive optimizations’. This is a feature that must be enabled at the account level so may not be suitable for all use cases as it will require an account admin to enable this. Once enabled on the account level, it can be applied to a catalog and schema by the relevant owners. Serverless compute must be enabled on your account (with relevant networking rules applied if in a private network).

Once these have been enabled, Databricks will identify tables that will benefit from these management commands (optimization and vacuum). These will only be run when deemed necessary which removes unnecessary runs and the manual burden of taking time to optimise tables. More information on predictive optimisation can be found in the Databricks documentation here.

Differences between Partitioning and Liquid Clustering

Dynamic vs. Static Structure:

Table Partitioning: Traditional partitioning divides a table into fixed segments based on the values of one or more columns (e.g., date or region). This structure is static, meaning once set, the partitions do not change unless manually altered. Which can lead to inefficiencies as data grows or query patterns evolve.
Liquid Clustering: Instead of predefined boundaries, Liquid Clustering continuously reorganises data clusters based on how data is accessed. As new data is ingested or as query patterns shift, Liquid Clustering adjusts the data layout to maintain optimal query performance.

Granularity:

Table Partitioning: Operates at the partition level, which can lead to large, unevenly distributed partitions if the chosen partition key does not distribute data evenly. This can cause data skew, where some partitions become hot spots, leading to inefficient query execution.
Liquid Clustering: Operates at the file or block level, allowing for more granular and flexible organisation of data. It clusters data not just by a single column but by multiple columns that are frequently used together in queries.

Efficiency:

Table Partitioning: Can lead to inefficiencies like small files, excessive metadata, and poor utilisation of resources if not managed properly. Over time, partitioning may become suboptimal, especially in rapidly changing datasets. Full data rewrites may be needed to fix issues.
Liquid Clustering: Avoids these inefficiencies by continuously clustering data in a way that maximises resource utilisation and minimises the overhead associated with traditional partitioning methods.

Conclusion

Do we think you should use Liquid Clustering? The short answer is: Yes. Databricks are recommending using LC for all new delta tables. As described in this blog, doing so does not add any complexity for the data engineers (in fact it removes it!), it improves the performance of tables for downstream users and offers for the flexibility to change keys if the needs of the data change over time.

Have tables you feel would benefit from this improvement in performance? Reach out to us here at AA to discuss how we may be able to help you, and your organisation optimise your platform performance.

View full post