loader

Data Engineer's Takeaways from Spark & AI Summit 2019

We spent the best part of October speaking at various conferences in Europe, but we took special time out to attend the Spark & AI Summit in Amsterdam – two days of keynotes, spark deep-dives, machine learning use cases and some fantastic conversations.

Unlike many of the Microsoft conferences, which focus on the cloud platform itself and getting people up to speed, much of the summit was spent looking forwards to new spark features, elements of Spark 3.0 and what we can expect in the future.

I thought I’d collect my thoughts, learnings and general takeaways into a Data Engineer’s Perspective on the Spark & AI Summit.

Keynotes

The first day’s keynotes included some good information about the general Databricks approach, as well as some industry spotlights to hammer home the impact these tools are having. By far the biggest news for us however, was the Spark 3.0 session led by Michael Armbrust. We’ll dig a little into these announcements later.

The second day saw some fantastic demonstrations of the model registry features coming to MLFlow, but I’ll let Terry expand on that front. We also saw some great examples of how Microsoft is turning AI to social & environmental projects, again shining a light on how important AI is becoming in the technology landscape. The Microsoft session also quietly announced mmlspark, a selection of spark libraries curated by Microsoft for integrating with cognitive services and performing HTTP activities, distributed across spark clusters.

Finally, we had a double-whammy of brain-blowing mid-day keynotes - Gaël Varoquaux, one of the brains behind scikit-learn talked about the latest features and improvements, before Katie Bouman took the stage to provide an expansive look into the challenges faced by the team who took the first picture of a Black Hole. Reducing a room of spark developers & data scientists to gently-steaming brain puddles is no mean feat, so kudos to both.

Spark 3.0

So first, we need to talk about Spark 3.0, a huge raft of changes to the Apache Spark project that's been rumbling for a while now. We had a hunch that we'd be seeing a grand release announcement at the summit, but sadly it's still a while away. We did however, get confirmation that we'll be seeing a preview of Spark 3.0 during this quarter - so it's getting close at least.

Data Catalog

This is fairly huge news - Databricks has included a baked-in HIVE metastore for a long while, but this is evolving. We don't have a lot of details yet, but this feature covers extending and expanding spark's ability to connect to other data sources, to issue push-down queries to more than just HDFS sources and start moving spark towards data virtualisation. This is going to be interesting, as there are a slew of other products heading for that prime one-stop-data-shop positioning.

Optimiser

The catalyst engine inside spark is also getting a face-lift - there are two main areas we've heard about. Firstly, it'll do some intelligent query optimisation during runtime. This is fairly big - we normally have some base statistics about the data beneath our dataframes, but once the query plan is set, it'll see it through to the end. With Spark 3.0, if it sees that a dataframe is different to the assumptions the query is built on, it will change the query plan during runtime.

Our second optimisation is around cross-filtering - normally we have to be fairly explicit with predicate pushdown, including it as a filter in our dataframe specifications. The new engine will see filters placed on tables we're joining to and push those filters across to other dataframes and perform predicate pushdowns. This makes kimball-style star schemas much more attractive, we can join our fact table to a dimension, filter on a dimension attribute and have that filter trickle back to the fact table's partitions if relevant.

Processing Pipelines

Pipelines are a concept that will be familiar to any Spark-based data scientists out there. You can build up a series of transforms and pass them into a pipeline as a sequence to apply on an input dataframe. One of the concepts in the works with Spark 3.0 is the idea of data processing pipelines, building up a series of data cleaning, validation and transformation activities that can then be built into a dependency graph.

This is certainly an interesting concept, building flexible and robust dependencies into our spark-based data processing sounds like a good idea all around. What worries me is that this is a very inwards-looking idea. Any data processing pipelines in the cloud need to speak to a wide range of different tools and data stores, so using an abstracted orchestration system like Azure Data Factory, that can communicate across a huge range of tools, seems a better idea than building orchestration into the nuts and bolts of the compute engine. But that reflects the Databricks approach - the Unified Data Analytics platform wants to have everything encapsulated in a single system.

Regardless, they're still a very interesting idea and could certainly be built into smaller processing services. Definitely one to look out for as work progresses and we see more previews of the functionality.

Spark .NET

We were hoping that last week would see the global availability of the .NET for Spark libraries, bringing C# to the dataframe API and support for an entire ecosystem of .NET libraries. Sadly this isn't the case, the libraries are still in preview, perhaps waiting on the release of Spark 3.0 in months to come. The libraries are, however, compatible with the current spark versions (2.3 onwards), however they require some cluster initiation scripts and manual setup in order to function.

Engineering Nuggets

I leaned heavily towards the Data Engineering sessions - anything involving ETL, data preparation, lake management or optimisation, I was there. There were a few common themes throughout that I want to highlight here.

DeltaLake

You love parquet, but you miss the transactional consistency of relational databases? That's the problem delta is trying to solve. Databricks created Delta as a proprietary set of libraries for managing parquet storage, but later open sourced it - this means we're left with two different sets of the same libraries, the open sourced DeltaLake OSS, and the baked-in proprietary Databricks Delta.

The clear message across all sessions is that Delta is seen as the future if you're using spark as a data transformation engine, it was nearly impossible to step into a session about ETL without Delta being the core storage choice. At Advancing Analytics, we've been a little on the fence about Delta for a couple of reasons, all geared around how you can access your data once it's in delta format.

Delta works by storing your data as parquet files, readable by anything that can read parquet natively. However, it also creates a json-based transaction log, allowing you to perform logical deletes, rollbacks, historical queries etc. Much of this boils down to keeping multiple copies of your parquet files and using the transaction log to determine which files to access. This unlocks a wealth of functionality which is fantastic - but if an external system, such as Azure Data Factory or Azure SQL Data Warehouse tries to read the data, they'll read all of the underlying parquet files and end up with duplicated data. Over in the Amazon world, they're starting to roll out delta format compatibility using manifest files, as discussed here, however there are several caveats to this preview functionality, so even in AWS it's fairly immature.

That said, more and more of our clients are jumping on Databricks as the one-stop-shop for all lake-based data transformation and analysis, and if you're writing to Azure SQL Datawarehouse directly from the Databricks cluster, you can use delta without fear.

Streaming ETL

Working under this "all things delta" assumption, another trend was pretty clear - all ETL approaches were based on streaming. There's an element of eye-rolling here, streaming has been the "next big thing" in data for an age now, and we've used various architectures to deal with it. The main thrust of the argument is as follows:

Parquet is an ideal file format for lake-based analytics and so is heavily adopted as the standard. However, parquet performs best over large batches of data and is horrendous when many small files are created, which is what streaming forces. One of the functions that delta brings is the ability to compact files - ie: to collect many small files and recompress into larger, more efficient ones. You can therefore stream data directly into a delta table and have it auto-optimise the underlying files on a regular basis.

Given this, we can now stream data directly into analytics-optimised storage, which means we can then build transformations that are using micro-batched compute directly on realtime data. One of the criticisms of the Lambda architecture was the need to maintain two separate data streams, the formal batch process and the "top-up" stream process - by doing everything in delta tables we're essentially combining the Lambda and Gamma architectures into something new.

Now… if I had a data source pumping out data on a regular basis, would I then spin through each batch and fire off events for each stream…? No, probably not, but it's worth thinking about. If we can move away from batch extracts from various systems and send regular micro-batches of change data, that certainly changes how we think about ETL. Whether the sessions at Spark & AI Summit indicate a wider industry trend, or whether talking batch data processing is simply not as exciting a conference session, remains to be seen.

Scaling Clusters

Another interesting topic was that of cluster scaling. We've seen the Databricks autoscaling feature, which will dynamically add/remove nodes to a running spark cluster based on usage, but there are some additional considerations that came up:

 

  1. Optimised or Standard - one gotcha I was unaware of was the difference between Premium and Standard Databricks editions when it comes to autoscaling. Premium workspaces employ "optimised" autoscaling, this will dynamically resize the cluster, checking utilisation every 150 seconds (for interactive clusters) and will make several hops (ie: from 8 workers to 4) in a single step. Standard mode however, will only scale every 10 minutes and will start reducing by a single node - if you have a large cluster, this can be painfully slow to react and cost a lot of money.

 

  1. Pools - If starting/stopping nodes in a cluster is still taking too long and causing unpredictable processing times, there's another option now in Databricks which had a bit of spotlight during the summit. It actually snuck in earlier this year, but I think it's worth restating how it works. We now have an idea of "instance pools" - these are essentially a bank of virtual machines that are kept warm (ie: turned on and costing money). These pools can then be used to quickly start up a cluster, usually in less than a minute. This makes a huge difference if you are constantly starting/stopping/scaling clusters, although it obviously comes at a cost. These idle VMs are cheaper than simply leaving a cluster turned on, but for me, take away some of the point of using elastically scaling systems.

 

In one mindset, they're fairly similar to elastic pools with Azure SQL DB. If you're commited to an overall spend across a couple of clusters, but want the flexibility to scale things differently within a fixed amount of shared resource, then instance pools will be perfect for you. Personally, I'd rather take the 4-5 minute hit on cluster startup and spend my budget on further functionality elsewhere.

Lake Optimisation

This was a big topic for me at the Spark & AI summit - I had a lot of good conversations, saw a lot of different approaches and learned a couple of tricks that I hadn't seen before. I'll be digging more into Delta and Parquet optimisation in a later blog, so keep an eye out for that.

Future of Data Engineering & Spark

Data Engineering is a funny topic in the world of spark - for so long it was a subservient function to data science, tasked with preparing data for the data scientists to consume. As the world of Business Intelligence and traditional analytics catches up, data engineering has evolved into an important function in it's own right, supporting a range of data consumers across organisations. What's great to see if that the Apache Spark story is evolving alongside it - we're seeing more and more focus on tools and functions within the engine that are around preparing and transforming data, at scale.

It's clear that DeltaLake is central to much of this thinking and, as painful as integrations with the wider data landscape are at the moment, some of the functionality unlocked by Delta is worth it. We're expecting more and more out of the Delta community and Databricks are pushing hard for it to be the defacto format of choice for Data Lakes.

With further engineering integrations in the pipeline, such as data processing pipelines themselves, it's an interesting time to be a Spark Data Engineer. We'll be keeping a close eye on Spark 3.0 as it nears public release and we'll be posting soon about DeltaLake and where we see it fitting into the wider Azure Analytics Platform.

 

author profile

Author

Simon Whiteley