Azure Databricks
Last updated
Last updated
Azure Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale. The Databricks Data Intelligence Platform integrates with cloud storage and security in your cloud account and manages and deploys cloud infrastructure on your behalf.
Azure Databricks provides tools that help you connect your sources of data to one platform to process, store, share, analyze, model, and monetize datasets with solutions from BI to generative AI.
The Azure Databricks workspace provides a unified interface and tools for most data tasks, including:
Data processing scheduling and management, in particular ETL
Generating dashboards and visualizations
Managing security, governance, high availability, and disaster recovery
Data discovery, annotation, and exploration
Machine learning (ML) modeling, tracking, and model serving
Generative AI solutions
VNET Injection
Make sure Databricks is deployed within your VNET (instead of Azure managed VNET) or in other words called as โVNET Injectionโ.
A Databricks workspace deployment in Azure can be logically divided into Data plane and Control plane.
Control plane is deployed in Azure managed subscription and consists of Databricks WebApp, cluster management, etc.
Data plane is deployed within customer subscription, and this is where actual data is processed.
Azure deploys the entire Infrastructure, including a new VNET and required subnets in a locked resource group which cannot be altered once deployed.
Deploying ADB resources within our own VNET, which is called as VNET Injection, provides multiple security flexibilities such as, connecting to other Azure resources via Private endpoints, routing traffic via NVA (network virtual appliance) to scout network traffic, define controls over egress traffic, etc ...
No Public IP (NPIP)
ADB workspace should be limited to private network only.
We can deploy the workspace with โNo Public IPโ option enabled.This is called Secure cluster connectivity and also know as No Public IP (NPIP) implementation.
When a ADB cluster starts, it initiates connection from data plane to the control plane over secure relay network.
we club VNET injection with NPIP and as a result both the Databricks subnets will be private.
Routing requests via Network Virtual Appliance (NVA)
Azure firewall with custom User defined routes (UDRโs) forwarding requests from subnets.
Workspace - User Login Page
Notebooks - Very similar to jupyter notebook - we can create develop the code and we can use the multiple programming languages. We can integrate with GitLab Repository (Gitcontrol).
Tables - Data Explorer or Tables - very similar to traditional Databases
Clusters - Provide access - accessing the data - scale up and scale down and handle
Libraries - we can install different libraries and python, Java, etc. (how we can install the libraries)
Store -> Process -> Serve
Databricks Community Edition and Azure Databricks Premium Edition
1.Sign up the free data Bricks community edition -
Driver and Executor and it's called cluster
Notes: App Registration and create the secrets and store the secrets to Azure Key Vault
How to secure your notebook and cluster?
Azure Data Services Integration
1.Azure Data Factory
2.Azure Data Lake Storage
3.Power BI
4.Azure Synapse Analytics
Azure AD integration
SOC2 Type2 Reports Required
Security Testing
Security Scanning
DPO
Azure Databricks control plane - which runs in Microsoft subscription
Control plane and data plane encryption is required. - TLS 1.2
No data transfer is necessary
Network Security
Identity and Access
Compliance
Data Protection
With the secure cluster connectivity feature, azure databricks cluster nodes now do not have any public ip's and there are no inbound rules required from the control plane to data plane.
All connections from the data plane are only outbound to the control plane using a scalable relay that's hosted in the control plane.(feature available standard and premium tier for both vnet injected and managed vnet workspaces).
It integrates with IAM, AAD for identity and KMS/Key vault for encryption of data, STS for access tokens, security groups/NSGs for instance firewalls. This gives enterprises control over their trust anchors, centralize their access control policies in one place and extend them to Databricks seamlessly.
Bring your own network - set it up in your own enterprise-managed virtual network, in order to do necessary customizations as required by your network security team.
Enable secure cluster connectivity - Deploy your Azure Databricks workspace in private subnets without any inbound access to your network. Clusters will utilize a secure connectivity mechanism to communicate with the Azure Databricks infrastructure, without requiring public IP addresses for the nodes.
Control which networks are allowed to access a workspace - Configure allow-lists and block-lists to control the networks that are allowed to access your Azure Databricks workspace.
Trust but verify with Azure Databricks - Get visibility into relevant platform activity in terms of whoโs doing what and when, by configuring Azure Databricks Diagnostic Logs and other related audit logs in the Azure Cloud.
Securely accessing Azure Data sources from Azure Databricks - Understand the different ways of connecting Azure Databricks clusters in your private virtual network to your Azure Data Sources in a cloud-native secure manner.
Data exfiltration protection with Azure Databricks -
Enable customer-managed key for managed services - Azure Databricks notebooks are stored in the scalable management layer powered by Microsoft, and are by default encrypted with a Microsoft-managed key. You could also bring your own-managed per-workspace key to encrypt the notebooks.
Enable customer-managed key for DBFS - Azure Databricks creates a root storage account (DBFS) per workspace in customerโs subscription. By default, the storage account is encrypted with a Microsoft-managed key. You also bring your own-managed key to encrypt the DBFS storage account.
Simplify data lake access with Azure AD Credential Passthrough -
Authenticate using Azure Active Directory tokens -
Token management for Personal Access Tokens -
Azure Databricks is HITRUST CSF Certified -