Reference architectures are great! You've got all of the key components in there, nice and clear. Colourful lines showing how data moves through each stage, product, or service. Great for a slide deck or a proposal to get rid of that old creaking data warehouse and into a shiny new Data Lakehouse.
Not so great for the finer details demanded by security operations teams however.
Diagrams like the one above are perfect to demonstrate a technology stack and summarise a platform but what about network security, what about all the public endpoints and standard ports all sitting there in the PUBLIC cloud?
I'm not aiming for clickbait here and in reality, the vast majority of Azure-based resources and services are protected through Azure Active Directory (AAD) Role-based Access Control (RBAC) on top of the ability to set up IP whitelisting which is just fine for many organisations and use cases BUT they won't put every security team at ease.
For the rest, there are many more security layers and features we can apply. I will walk through how Data Lake Storage and Azure SQL differ from Databricks, how Data Factory should be secured, but also pick out the features that are just going to cause you pain. Lets start with the foundations for all of those services.
Thinking about the architecture above, one of the most common ways to secure these resources is by wrapping them all up in a virtual network (Vnet). This gives us the ability to control what traffic comes in and what traffic goes out (if any) of our network. That VNet can then be peered to other VNets and your internal network to facilitate connectivity, with subnets inside for specific resource types.
Now, I'm not a network engineer and exact configurations will always differ across organisations, so your implementation may vary. Microsoft's Azure Networking Architectures is a good place to start if this is all new to you.
That's going to look a little like this if we are building out a diagram for our secure Azure Data Platform
With this baseline we've now inadvertently started restricting the tools we can work with and how we can deploy some resources and this is the biggest risk to that initial reference architecture.
Once you start digging deeper and securing the platform to meet your organisations security policies, some features start to go away and others need much more complex implementations
The simplicity of using Azure DevOps for deployments is one of the reasons I've rarely ventured away from it. Sure, its frustrating getting YAML files right but just hitting run and letting Microsoft worry about the build agent makes up for that.
That happy path isn't possible when working with deployments inside VNets as described in the Networking section on this Microsoft Docs page. It's possible to stick with a Microsoft-hosted build agent but you're left opening and updating IP ranges every week! A hosted build agent becomes a much simpler approach, which means we'd need to provision one or more virtual machines within our Vnet to use as a build agent. That's additional cost, resources, and administration. It's not an insurmountable task but it's an oft overlooked one.
In the rest of this series, I'll look at all of the resources common to our shiny new Data Lakehouse platform architecture and what you need to think about to get it past your security team