Page cover image

✈️Azure Databricks

Azure Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale. The Databricks Data Intelligence Platform integrates with cloud storage and security in your cloud account and manages and deploys cloud infrastructure on your behalf.

Azure Databricks used for?

Azure Databricks provides tools that help you connect your sources of data to one platform to process, store, share, analyze, model, and monetize datasets with solutions from BI to generative AI.

The Azure Databricks workspace provides a unified interface and tools for most data tasks, including:

  • Data processing scheduling and management, in particular ETL

  • Generating dashboards and visualizations

  • Managing security, governance, high availability, and disaster recovery

  • Data discovery, annotation, and exploration

  • Machine learning (ML) modeling, tracking, and model serving

  • Generative AI solutions

VNET Injection

Make sure Databricks is deployed within your VNET (instead of Azure managed VNET) or in other words called as “VNET Injection”.

A Databricks workspace deployment in Azure can be logically divided into Data plane and Control plane.

Control plane is deployed in Azure managed subscription and consists of Databricks WebApp, cluster management, etc.

Data plane is deployed within customer subscription, and this is where actual data is processed.

Azure deploys the entire Infrastructure, including a new VNET and required subnets in a locked resource group which cannot be altered once deployed.

Deploying ADB resources within our own VNET, which is called as VNET Injection, provides multiple security flexibilities such as, connecting to other Azure resources via Private endpoints, routing traffic via NVA (network virtual appliance) to scout network traffic, define controls over egress traffic, etc ...

No Public IP (NPIP)

ADB workspace should be limited to private network only.

We can deploy the workspace with “No Public IP” option enabled.This is called Secure cluster connectivity and also know as No Public IP (NPIP) implementation.

When a ADB cluster starts, it initiates connection from data plane to the control plane over secure relay network.

we club VNET injection with NPIP and as a result both the Databricks subnets will be private.

Routing requests via Network Virtual Appliance (NVA)

Azure firewall with custom User defined routes (UDR’s) forwarding requests from subnets.

Five Key Databricks UI components

  1. Workspace - User Login Page

  2. Notebooks - Very similar to jupyter notebook - we can create develop the code and we can use the multiple programming languages. We can integrate with GitLab Repository (Gitcontrol).

  3. Tables - Data Explorer or Tables - very similar to traditional Databases

  4. Clusters - Provide access - accessing the data - scale up and scale down and handle

  5. Libraries - we can install different libraries and python, Java, etc. (how we can install the libraries)

Dataflow for Databricks

Store -> Process -> Serve

Databricks Community Edition and Azure Databricks Premium Edition

1.Sign up the free data Bricks community edition -

Driver and Executor and it's called cluster

Notes: App Registration and create the secrets and store the secrets to Azure Key Vault

How to secure your notebook and cluster?

Azure Data Services Integration

1.Azure Data Factory

2.Azure Data Lake Storage

3.Power BI

4.Azure Synapse Analytics

Azure AD integration

SOC2 Type2 Reports Required

Security Testing

Security Scanning

DPO

Azure Databricks control plane - which runs in Microsoft subscription

Control plane and data plane encryption is required. - TLS 1.2

No data transfer is necessary

Network Security

Identity and Access

Compliance

Data Protection

With the secure cluster connectivity feature, azure databricks cluster nodes now do not have any public ip's and there are no inbound rules required from the control plane to data plane.

All connections from the data plane are only outbound to the control plane using a scalable relay that's hosted in the control plane.(feature available standard and premium tier for both vnet injected and managed vnet workspaces).

Security that Unblocks the True Potential of your Data Lake

  • It integrates with IAM, AAD for identity and KMS/Key vault for encryption of data, STS for access tokens, security groups/NSGs for instance firewalls. This gives enterprises control over their trust anchors, centralize their access control policies in one place and extend them to Databricks seamlessly.

  • Isolate the environment

  • Bring your own network - set it up in your own enterprise-managed virtual network, in order to do necessary customizations as required by your network security team.

  • Enable secure cluster connectivity - Deploy your Azure Databricks workspace in private subnets without any inbound access to your network. Clusters will utilize a secure connectivity mechanism to communicate with the Azure Databricks infrastructure, without requiring public IP addresses for the nodes.

  • Control which networks are allowed to access a workspace - Configure allow-lists and block-lists to control the networks that are allowed to access your Azure Databricks workspace.

  • Trust but verify with Azure Databricks - Get visibility into relevant platform activity in terms of who’s doing what and when, by configuring Azure Databricks Diagnostic Logs and other related audit logs in the Azure Cloud.

  • Securely accessing Azure Data sources from Azure Databricks - Understand the different ways of connecting Azure Databricks clusters in your private virtual network to your Azure Data Sources in a cloud-native secure manner.

  • Data exfiltration protection with Azure Databricks -

  • Enable customer-managed key for managed services - Azure Databricks notebooks are stored in the scalable management layer powered by Microsoft, and are by default encrypted with a Microsoft-managed key. You could also bring your own-managed per-workspace key to encrypt the notebooks.

  • Enable customer-managed key for DBFS - Azure Databricks creates a root storage account (DBFS) per workspace in customer’s subscription. By default, the storage account is encrypted with a Microsoft-managed key. You also bring your own-managed key to encrypt the DBFS storage account.

  • Simplify data lake access with Azure AD Credential Passthrough -

  • Authenticate using Azure Active Directory tokens -

  • Token management for Personal Access Tokens -

  • Azure Databricks is HITRUST CSF Certified -

Last updated