At the Data + AI Summit, Simon delivered a session on “Achieving Lakehouse Models with Spark 3.0“. During the session there were a load of great questions. Here the questions and the answers from Simon. Drop a comment if you were in the session and have any follow up questions!
It's a direction that most providers are heading in, albeit under the "unified analytics" or "modern warehouse" name rather than the "lakehouse". But most big relational engines are moving to bring in spark/big data capabilities, other lake providers are looking to expand their SQL coverage. It's a bit of a race to who gets to the "can do both sides as well as a specialist tool" point first. Will we see other tools championing it as a "lakehouse", or is that term now tied too closely as a "vendor-specific" term coming from Databricks? We'll see...
Absolutely - there are a load of different tools and approaches we can use that are more native to lakes, but they're not going to be the first to mind for people coming from a SQL/Warehousing background, they're not going to be able to lift & shift their processes over. And it's all about accessibility to those additional personas
When lakehouse is fully matured, that's the plan - have a single source of data whether it's engineering, data science, ad-hoc analytics or traditional BI. We're getting there, but there are still some edge cases when you would need a mature relational engine, be it for some of the security features, tooling integration etc - but as it matures, fewer and fewer of these edge cases remain. But fundamentally, they're serving the same purpose, so we're heading to the point where you have just one
Yep, using HIVE in most cases, especially when it's a Databricks-based architecture. We augment this with our own config db - just a lightweight metadata store that we use to store different processing paths, transformation logic etc. An element of this includes comments/descriptions that are used to augment the HIVE tables with additional info
We stick with fully Azure native, so we use super basic Azure Data Factory pipelines - which in themselves aren't as dynamic as airflow, dagster etc. But we keep it super simple, ADF basically checks out metadata database, gets a list of things to kick off, and fires the tasks, we deliberately stay away from any more hardcoded pipelines. We had to make a balance between choosing something our clients have some knowledge of currently vs picking a tool that would have more functionality / harder learning curve
Click the link to view the session