Analytics Engineering is a subject close to my heart, having emerged as a new discipline over the last 18 months that perfectly encapsulates the value I feel I can bring to a data stack. My colleague Ust has previously blogged about Analytics Engineering where he introduced the concept.
As a brief reminder, the way we think about Analytics Engineering is as the discipline that covers the final mile of a data platform.
An Analytics Engineer provides clean, transformed, analysis-ready data. They apply software engineering best practices to the curation of data, such as version control, testing, CI/CD, modularisation and reusability of code. They maintain data documentation and definitions. They enable users with the provisioning of analysis-ready data.
As a Fabric enthusiast, I’ve been exploring ways to apply Analytics Engineering principles in Fabric. And the answer isn’t necessarily straight forward. Setting aside Power BI Semantic Models for the time being (perhaps a blog for another day) there are multiple different ways to create transformed, analysis-ready data.
So the idea in this blog is to explore a few of those options and try and steer you towards when and why you might use them and which personas they may suit. To be clear about the problem I’m looking to solve here, I’m not talking about onboarding and cleansing of data, but the final layer of a Data Lakehouse, oft referred to as the “Gold” layer (though we tend to prefer the term “Curated” at AA). Here data is modelled (our preference is to use dimensional modelling patterns to create star schemas) and made ready for analytics use cases and workloads.
The first method I’m going to cover is using Notebooks. If responsibility for building and maintaining your Curated Layer sits firmly with your Data Engineering team, then this code-first approach to Analytics Engineering could well be a good choice. For those card-carrying Spark fans that are comfortable working in Python, I’d argue that Notebooks may well be the most efficient method. Notebooks are highly parameterisable, lending themselves to meta-data driven approaches, and with support for custom Python libraries also available, they give opportunity to write custom Python wheels with functions that can further support Don’t Repeat Yourself (DRY) programming principles. If you’re an experienced Lakehouse nut perhaps coming from a Databricks background, this is going to give you a set of development patterns and a similar development experience to a world you’re already familiar with.
If method one is perhaps the “pro-code” option then method two, using Dataflows, fits the bill as the “low-code” option. The Citizen Engineering movement of the last few years has already seen analysts start to flex their ETL muscles, either with Power BI desktop or by using the first generation of dataflows made available in the Power BI Service. I’ve used Dataflows (Gen 1) in the past with smaller clients who didn’t want to invest in a full data platform, using them as a pseudo data warehouse. You can now have the best of both world’s with the low-code PowerQuery online experience able to save data as delta format in OneLake. I’ve used Datafows Gen 2 a fair bit in Fabric - I do have some question marks over their scalability and find the lack of support for functions and parameterisation a little frustrating, but they do allow you to get up and running pretty quickly and at time of writing are the only option for connecting to on-prem or private networked data. They definitely have their place.
Long before I’d heard of a Data Lakehouse, I’d describe myself as a full stack Business Intelligence (BI) developer. I was a SQL nerd (well, DAX too to be fair) and spent a lot of my time with my head in SQL Server and/or Azure SQL Database, the de facto Data Warehouse platforms of Microsoft Fanboy data professionals in the day. Option three, using SQL Projects with the Fabric Data Warehousing experience gives the opportunity to relive that experience. That wasn’t intended to sound derogatory - there must be millions and millions of hours of combined experience across the globe that have contributed to well established good practices in this set-up. You can use Visual Studio or Azure Data Studio to manage your project and even things like CI/CD paths are well trodden too. This approach allows you to take the plunge in to the Lakehouse without the immediate need (or minimal needs at least) to learn new skills and techniques.
I like to drink craft beer, I have a big beard and a vast collection of plaid shirts… so I have a soft spot for this option. Option four uses dbt (Data Build Tool) , perhaps the pioneers of Analytics Engineering. This is another option that uses the Data Warehouse experience in Fabric. The dbt adapter for use with Fabric was announced back in August and I’ve been playing with dbt Core, the free version of dbt, since. I like it. It reminds me of working with a SQL Project but gives me added features such as lineage, documentation, automated testing and macros. In November support for dbt Cloud was also announced. The unanswered questions I have with dbt so far are around scalability and orchestration - the dbt paradigm is to build tables in full on every pipeline run and although incremental loading is possible, it’s not the default. Unlike Databricks, there is no easy native way to run dbt Core jobs in Fabric, whilst dbt Cloud does have scheduling built in, but comes at an extra cost. I’m enjoying continuing to explore this option though, and it is perhaps a good choice if either you’re already familiar with dbt or are looking to evolve your BI Dev skills without turning in to a big ‘ole sparky boy.
Is there such a thing as too much choice? I haven’t even covered some other options, such as using shortcuts to alternative data lakes or using products outside of Fabric to land data directly. But surely being able to choose a path that suits your needs best is a good thing? I’ve tried to stray away from being prescriptive here and say whether I think a particular method is “the best” - they each have their own pros and cons and ultimately the “right” choice will be different across varying organisations. I think that’s better than being confined by only one happy path. And who’s to say you even have to choose one option? Especially in a world of decentralized data teams, Fabric gives you the option to mix and match depending on the preferences of the data professionals working on a given data product. I think this plays nicely into the “discipline at the core, flexibility at the edges” message that Microsoft have previously championed with regard to citizen data engineers, with room for highly complex frameworks and smaller tactical approaches all available within one platform.
I’d be fascinated to hear other’s opinions on this, and your experiences working with Fabric. How do you envisage data curation happening in your implementations?
Want to hear me harp on about Analytics Engineering and Fabric some more? I’ll be speaking on the topic at Fabric February in Oslo in February (surprisingly!). And if you want help deciding your best curation strategy for Fabric, of course don’t hesitate to get in touch with Advancing Analytics by emailing us at hello@advancinganalytics.co.uk.