diff --git a/e2e_samples/fabric_dataops_sample/libraries/test/ddo_transform/data/README.md b/e2e_samples/fabric_dataops_sample/libraries/test/ddo_transform/data/README.md new file mode 100644 index 000000000..fc7f1ea9c --- /dev/null +++ b/e2e_samples/fabric_dataops_sample/libraries/test/ddo_transform/data/README.md @@ -0,0 +1,13 @@ +# Data Generation + +The data in the files below is generated using Python and the Faker library. + +- parking_bay_data.json +- parking_sensor_data.json + +The data includes dummy/fake records for testing and development purposes. The latitude and longitude coordinates are confined to the approximate location within the Microsoft Redmond Campus. + +This data will be used for demonstrating: + +- Ingestion, standardization, transformation of data engineering pipelines. +- Writing unit test cases for python and pyspark transformation code. diff --git a/e2e_samples/parking_sensors/README.md b/e2e_samples/parking_sensors/README.md index 56e0c7c5d..6bea5e7ce 100644 --- a/e2e_samples/parking_sensors/README.md +++ b/e2e_samples/parking_sensors/README.md @@ -224,7 +224,7 @@ Follow the setup prerequisites, permissions, and deployment environment options. 2. [Azure Account](https://azure.microsoft.com/en-us/free/) If you do not have one already, create an Azure Account. - *Permissions needed*: ability to create and deploy to an azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group. 3. [Azure DevOps Project](https://azure.microsoft.com/en-us/products/devops/) : Follow the documentation to create a new project, or use an existing project you wish to deploy these resources to. - - *Permissions needed*: ability to create [service connections](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/service-endpoints?view=azure-devops&tabs=yaml), [pipelines](https://docs.microsoft.com/en-us/azure/devops/pipelines/get-started/pipelines-get-started?view=azure-devops&tabs=yaml) and [variable groups](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/variable-groups?view=azure-devops&tabs=yaml). + - *Permissions needed*: It is required to be able to create [Service Connections](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/service-endpoints?view=azure-devops&tabs=yaml), [pipelines](https://docs.microsoft.com/en-us/azure/devops/pipelines/get-started/pipelines-get-started?view=azure-devops&tabs=yaml) , [variable groups](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/variable-groups?view=azure-devops&tabs=yaml) and allow *Manage Project Properties* as [Endpoint Administrator](https://learn.microsoft.com/en-us/azure/devops/pipelines/policies/permissions?view=azure-devops#set-service-connection-security-in-azure-pipelines). #### Deployment Options @@ -294,16 +294,6 @@ Set up the environment variables as specified, fork the GitHub repository, and l **Login and Cluster Configuration** - Ensure that you have completed the configuration for the variables described in the previous section, titled **Configuration: Variables and Login**. - - - This configuration will be used during the environment deployment process to facilitate login. - - Create a `cluster.config.json` Spark configuration from the [`cluster.config.template.json`](./databricks/config/cluster.config.template.json) file. For the "node_type_id" field, select a SKU that is available from the following command in your subscription: - - ```bash - az vm list-usage --location "" -o table - ``` - - - In the repository we provide an example, but you need to make sure that the SKU exists on your region and that is available for your subscription. 2. **Deploy Azure resources** - `cd` into the `e2e_samples/parking_sensors` folder of the repo. @@ -442,6 +432,11 @@ The following lists some limitations of the solution and associated deployment s - Azure DevOps Variable Groups linked to KeyVault can only be created via the UI, cannot be created programmatically and was not incorporated in the automated deployment of the solution. - **Workaround**: Deployment add sensitive configuration as "secrets" in Variable Groups with the downside of duplicated information. If you wish, you may manually link a second Variable Group to KeyVault to pull out the secrets. KeyVault secret names should line up with required variables in the Azure DevOps pipelines. See [here](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/variable-groups?view=azure-devops&tabs=yaml#link-secrets-from-an-azure-key-vault) for more information. +- Azure DevOps Service Connection Removal: If you encounter an error like: *"Cannot delete this service connection while federated credentials for app exist in Entra tenant . Please make sure federated credentials have been removed prior to deleting the service connection."* This issue occurs when you try to delete a Service Connection in the Azure DevOps (AzDo) portal, but the Service Connection has federated credentials that need to be manually removed from the Azure Portal. + - **Workaround - Manually Deleting Federated Credentials:** + Navigate to the Azure portal and locate your app registration under App Registrations. In the left navigation pane, select Certificates & Secrets and then the Federated Credentials + tab. Delete the federated credential from this section. Once the credential is deleted, you can proceed to delete the app registration in the Azure Portal and the Azure Service + Connection in the AzDo portal. - Azure DevOps Environment and Approval Gates can only be managed via the UI, cannot be managed programmatically and was not incorporated in the automated deployment of the solution. - **Workaround**: Approval Gates can be easily configured manually. See [here](https://docs.microsoft.com/en-us/azure/devops/pipelines/process/environments?view=azure-devops#approvals) for more information. - ADF publishing through the CI/CD pipeline using the npm task still throws and error in the logs due to the missing publish_config.json file but the pipeline completes successfully. diff --git a/e2e_samples/parking_sensors/databricks/config/cluster.config.json b/e2e_samples/parking_sensors/databricks/config/cluster.config.json index c91e6376b..88221d078 100644 --- a/e2e_samples/parking_sensors/databricks/config/cluster.config.json +++ b/e2e_samples/parking_sensors/databricks/config/cluster.config.json @@ -1,8 +1,8 @@ { "cluster_name": "ddo_cluster", "autoscale": { "min_workers": 1, "max_workers": 2 }, - "spark_version": "15.4.x-scala2.12", - "autotermination_minutes": 10, + "spark_version": "14.3.x-scala2.12", + "autotermination_minutes": 30, "node_type_id": "Standard_D4as_v5", "data_security_mode": "SINGLE_USER", "runtime_engine": "PHOTON", diff --git a/e2e_samples/parking_sensors/infrastructure/main.bicep b/e2e_samples/parking_sensors/infrastructure/main.bicep index 2a9bc1561..812d7acf5 100644 --- a/e2e_samples/parking_sensors/infrastructure/main.bicep +++ b/e2e_samples/parking_sensors/infrastructure/main.bicep @@ -70,10 +70,6 @@ module keyvault './modules/keyvault.bicep' = { keyvault_owner_object_id: keyvault_owner_object_id datafactory_principal_id: datafactory.outputs.datafactory_principal_id } - - dependsOn: [ - datafactory - ] } @@ -107,10 +103,6 @@ module diagnostic './modules/diagnostic_settings.bicep' = if (enable_monitoring) loganalytics_workspace_name: loganalytics.outputs.loganalyticswsname datafactory_name: datafactory.outputs.datafactory_name } - dependsOn: [ - loganalytics - datafactory - ] } @@ -149,8 +141,6 @@ module alerts './modules/alerts.bicep' = if (enable_monitoring) { } dependsOn: [ loganalytics - datafactory - actiongroup ] } @@ -162,7 +152,6 @@ module data_quality_workbook './modules/data_quality_workbook.bicep' = if (enabl } dependsOn: [ loganalytics - appinsights ] } diff --git a/e2e_samples/parking_sensors/infrastructure/modules/dashboard.bicep b/e2e_samples/parking_sensors/infrastructure/modules/dashboard.bicep index 5a1fb0514..16d1a5007 100644 --- a/e2e_samples/parking_sensors/infrastructure/modules/dashboard.bicep +++ b/e2e_samples/parking_sensors/infrastructure/modules/dashboard.bicep @@ -46,17 +46,7 @@ resource dashboard 'Microsoft.Portal/dashboards@2022-12-01-preview' = { { name: 'options' isOptional: true - } - { - name: 'sharedTimeRange' - isOptional: true - } - ] - #disable-next-line BCP036 - type: 'Extension/HubsExtension/PartType/MonitorChartPart' - settings: { - content: { - options: { + value: { chart: { metrics: [ { @@ -96,8 +86,14 @@ resource dashboard 'Microsoft.Portal/dashboards@2022-12-01-preview' = { } } } - } - } + { + name: 'sharedTimeRange' + isOptional: true + } + ] + #disable-next-line BCP036 + type: 'Extension/HubsExtension/PartType/MonitorChartPart' + } } { position: { @@ -111,17 +107,7 @@ resource dashboard 'Microsoft.Portal/dashboards@2022-12-01-preview' = { { name: 'options' isOptional: true - } - { - name: 'sharedTimeRange' - isOptional: true - } - ] - #disable-next-line BCP036 - type: 'Extension/HubsExtension/PartType/MonitorChartPart' - settings: { - content: { - options: { + value: { chart: { metrics: [ { @@ -161,8 +147,14 @@ resource dashboard 'Microsoft.Portal/dashboards@2022-12-01-preview' = { } } } - } - } + { + name: 'sharedTimeRange' + isOptional: true + } + ] + #disable-next-line BCP036 + type: 'Extension/HubsExtension/PartType/MonitorChartPart' + } } { position: { @@ -176,17 +168,7 @@ resource dashboard 'Microsoft.Portal/dashboards@2022-12-01-preview' = { { name: 'options' isOptional: true - } - { - name: 'sharedTimeRange' - isOptional: true - } - ] - #disable-next-line BCP036 - type: 'Extension/HubsExtension/PartType/MonitorChartPart' - settings: { - content: { - options: { + value: { chart: { metrics: [ { @@ -236,8 +218,14 @@ resource dashboard 'Microsoft.Portal/dashboards@2022-12-01-preview' = { } } } - } - } + { + name: 'sharedTimeRange' + isOptional: true + } + ] + #disable-next-line BCP036 + type: 'Extension/HubsExtension/PartType/MonitorChartPart' + } } ] } diff --git a/e2e_samples/parking_sensors/scripts/clean_up.sh b/e2e_samples/parking_sensors/scripts/clean_up.sh index 9314d3d91..939488603 100755 --- a/e2e_samples/parking_sensors/scripts/clean_up.sh +++ b/e2e_samples/parking_sensors/scripts/clean_up.sh @@ -54,6 +54,14 @@ delete_all(){ az ad sp list -o tsv --show-mine --query "[?contains(appDisplayName,'$prefix') && contains(appDisplayName,'$DEPLOYMENT_ID')].displayName" fi + log "\nENTRA APP REGISTRATIONS:\n" + if [[ -z $DEPLOYMENT_ID ]] + then + az ad app list -o tsv --show-mine --query "[?contains(displayName,'$prefix')].displayName" + else + az ad app list -o tsv --show-mine --query "[?contains(displayName,'$prefix') && contains(displayName,'$DEPLOYMENT_ID')].displayName" + fi + log "\nRESOURCE GROUPS:\n" if [[ -z $DEPLOYMENT_ID ]] then @@ -79,17 +87,25 @@ delete_all(){ log "Deleting service connections that start with '$prefix' in name..." [[ -n $prefix ]] && + + sc_ids=($(az devops service-endpoint list --project "$AZDO_PROJECT" --organization "$AZDO_ORGANIZATION_URL" --query "[?contains(name, '$prefix')].id" -o tsv)) + for sc_id in "${sc_ids[@]}"; do + log "Processing Service Connection ID: $sc_id" + cleanup_federated_credentials "$sc_id" + done + #Important:Giving time to the portal process the cleanup + wait_for_process az devops service-endpoint list -o tsv --query "[?contains(name, '$prefix')].id" | xargs -r -I % az devops service-endpoint delete --id % --yes - + log "Finished cleaning up Service Connections" if [[ -z $DEPLOYMENT_ID ]] then - log "Deleting service principal that contain '$prefix' in name, created by yourself..." + log "Deleting service principals that contain '$prefix' in name, created by yourself..." [[ -n $prefix ]] && az ad sp list --query "[?contains(appDisplayName,'$prefix')].appId" -o tsv --show-mine | xargs -r -I % az ad sp delete --id % else - log "Deleting service principal that contain '$prefix' and $DEPLOYMENT_ID in name, created by yourself..." + log "Deleting service principals that contain '$prefix' and $DEPLOYMENT_ID in name, created by yourself..." [[ -n $prefix ]] && az ad sp list --query "[?contains(appDisplayName,'$prefix') && contains(appDisplayName,'$DEPLOYMENT_ID')].appId" -o tsv --show-mine | xargs -r -I % az ad sp delete --id % @@ -97,7 +113,20 @@ delete_all(){ if [[ -z $DEPLOYMENT_ID ]] then - log "Deleting resource groups that comtain '$prefix' in name..." + log "Deleting app registrations that contain '$prefix' in name, created by yourself..." + [[ -n $prefix ]] && + az ad app list --query "[?contains(displayName,'$prefix')].appId" -o tsv --show-mine | + xargs -r -I % az ad app delete --id % + else + log "Deleting app registrations that contain '$prefix' and $DEPLOYMENT_ID in name, created by yourself..." + [[ -n $prefix ]] && + az ad app list --query "[?contains(displayName,'$prefix') && contains(displayName,'$DEPLOYMENT_ID')].appId" -o tsv --show-mine | + xargs -r -I % az ad app delete --id % + fi + + if [[ -z $DEPLOYMENT_ID ]] + then + log "Deleting resource groups that contain '$prefix' in name..." [[ -n $prefix ]] && az group list --query "[?contains(name,'$prefix') && ! contains(name,'dbw')].name" -o tsv | xargs -I % az group delete --verbose --name % -y diff --git a/e2e_samples/parking_sensors/scripts/common.sh b/e2e_samples/parking_sensors/scripts/common.sh index 3a5bc9fe4..82662f5e0 100755 --- a/e2e_samples/parking_sensors/scripts/common.sh +++ b/e2e_samples/parking_sensors/scripts/common.sh @@ -127,3 +127,52 @@ create_adf_trigger () { adfTUrl="${adfFactoryBaseUrl}/triggers/${name}?api-version=${apiVersion}" az rest --method put --uri "$adfTUrl" --body @"${ADF_DIR}"/trigger/"${name}".json -o none } + +# Function to give time for the portal to process the cleanup +wait_for_process() { + local seconds=${1:-15} + log "Giving the portal $seconds seconds to process the information..." + sleep "$seconds" +} + +cleanup_federated_credentials() { + ##Function used in the Clean_up.sh and deploy_azdo_service_connections_azure.sh scripts + local sc_id=$1 + local spnAppObjId=$(az devops service-endpoint show --id "$sc_id" --org "$AZDO_ORGANIZATION_URL" -p "$AZDO_PROJECT" --query "data.appObjectId" -o tsv) + # if the Service connection does not have an associated Service Principal, + # then it means it won't have associated federated credentials + if [ -z "$spnAppObjId" ]; then + log "Service Principal Object ID not found for Service Connection ID: $sc_id. Skipping federated credential cleanup." + return + fi + + local spnCredlist=$(az ad app federated-credential list --id "$spnAppObjId" --query "[].id" -o json) + log "Attempting to delete federated credentials." + + # Sometimes the Azure Portal needs a little bit more time to process the information. + if [ -z "$spnCredlist" ]; then + log "It was not possible to list Federated credentials for Service Principal. Retrying once more.." + wait_for_process + spnCredlist=$(az ad app federated-credential list --id "$spnAppObjId" --query "[].id" -o json) + if [ -z "$spnCredlist" ]; then + log "It was not possible to list Federated credentials for specified Service Principal." + return + fi + fi + + local credArray=($(echo "$spnCredlist" | jq -r '.[]')) + #(&& and ||) to log success or failure of each delete operation + for cred in "${credArray[@]}"; do + az ad app federated-credential delete --federated-credential-id "$cred" --id "$spnAppObjId" && + log "Deleted federated credential: $cred" || + log "Failed to delete federated credential: $cred" + done + # Refresh the list of federated credentials + spnCredlist=$(az ad app federated-credential list --id "$spnAppObjId" --query "[].id" -o json) + if [ "$(echo "$spnCredlist" | jq -e '. | length > 0')" = "true" ]; then + log "Failed to delete federated credentials" "danger" + exit 1 + fi + log "Completed federated credential cleanup for the Service Principal: $spnAppObjId" +} + diff --git a/e2e_samples/parking_sensors/scripts/configure_databricks.sh b/e2e_samples/parking_sensors/scripts/configure_databricks.sh index bc60385f0..eaccd54b0 100755 --- a/e2e_samples/parking_sensors/scripts/configure_databricks.sh +++ b/e2e_samples/parking_sensors/scripts/configure_databricks.sh @@ -27,6 +27,7 @@ set -o nounset # KEYVAULT_RESOURCE_ID # KEYVAULT_DNS_NAME # USER_NAME +# AZURE_LOCATION . ./scripts/common.sh @@ -54,6 +55,43 @@ databricks workspace import "$databricks_folder_name/01_explore.py" --file "./da databricks workspace import "$databricks_folder_name/02_standardize.py" --file "./databricks/notebooks/02_standardize.py" --format SOURCE --language PYTHON --overwrite databricks workspace import "$databricks_folder_name/03_transform.py" --file "./databricks/notebooks/03_transform.py" --format SOURCE --language PYTHON --overwrite +# Define suitable VM for DB cluster +file_path="./databricks/config/cluster.config.json" + +# Get available VM sizes in the specified region +vm_sizes=$(az vm list-sizes --location "$AZURE_LOCATION" --output json) + +# Get available Databricks node types using the list-node-types API +node_types=$(databricks clusters list-node-types --output json) + +# Extract VM names and node type IDs into temporary files +echo "$vm_sizes" | jq -r '.[] | .name' > vm_names.txt +# Get available Databricks node types using the list-node-types API and filter node types to only include those that support Photon +photon_node_types=$(echo "$node_types" | jq -r '.node_types[] | select(.photon_driver_capable == true) | .node_type_id') + +# Find common VM sizes +common_vms=$(grep -Fwf <(echo "$photon_node_types") vm_names.txt) + +# Find the VM with the least resources +least_resource_vm=$(echo "$vm_sizes" | jq --arg common_vms "$common_vms" ' + map(select(.name == ($common_vms | split("\n")[]))) | + sort_by(.numberOfCores, .memoryInMB) | + .[0] +') +log "VM with the least resources:$least_resource_vm" "info" + +# Update the JSON file with the least resource VM +if [ -n "$least_resource_vm" ]; then + node_type_id=$(echo "$least_resource_vm" | jq -r '.name') + jq --arg node_type_id "$node_type_id" '.node_type_id = $node_type_id' "$file_path" > tmp.$$.json && mv tmp.$$.json "$file_path" + log "The JSON file at '$file_path' has been updated with the node_type_id: $node_type_id" +else + log "No common VM options found between Azure and Databricks." "error" +fi + +# Clean up temporary files +rm vm_names.txt + # Create initial cluster, if not yet exists # cluster.config.json file needs to refer to one of the available SKUs on yout Region # az vm list-skus --location --all --output table diff --git a/e2e_samples/parking_sensors/scripts/deploy_azdo_service_connections_azure.sh b/e2e_samples/parking_sensors/scripts/deploy_azdo_service_connections_azure.sh index 61c6ba5cc..dcf2fa918 100755 --- a/e2e_samples/parking_sensors/scripts/deploy_azdo_service_connections_azure.sh +++ b/e2e_samples/parking_sensors/scripts/deploy_azdo_service_connections_azure.sh @@ -30,6 +30,7 @@ set -o errexit set -o pipefail set -o nounset + ################### # REQUIRED ENV VARIABLES: # @@ -37,46 +38,92 @@ set -o nounset # ENV_NAME # RESOURCE_GROUP_NAME # DEPLOYMENT_ID +############### . ./scripts/common.sh -############### -# Setup Azure service connection +########################################### +# Setup Azure service connection variables +########################################### az_service_connection_name="${PROJECT}-serviceconnection-$ENV_NAME" - az_sub=$(az account show --output json) az_sub_id=$(echo "$az_sub" | jq -r '.id') az_sub_name=$(echo "$az_sub" | jq -r '.name') -# Create Service Account -az_sp_name=${PROJECT}-${ENV_NAME}-${DEPLOYMENT_ID}-sp -log "Creating service principal: $az_sp_name for azure service connection" -az_sp=$(az ad sp create-for-rbac \ - --role contributor \ - --scopes "/subscriptions/${az_sub_id}/resourceGroups/${RESOURCE_GROUP_NAME}" \ - --name "$az_sp_name" \ - --output json) -service_principal_id=$(echo "$az_sp" | jq -r '.appId') -az_sp_tenant_id=$(echo "$az_sp" | jq -r '.tenant') - -# Create Azure Service connection in Azure DevOps -azure_devops_ext_azure_rm_service_principal_key=$(echo "$az_sp" | jq -r '.password') -export AZURE_DEVOPS_EXT_AZURE_RM_SERVICE_PRINCIPAL_KEY=$azure_devops_ext_azure_rm_service_principal_key - -if sc_id=$(az devops service-endpoint list -o json | jq -r -e --arg sc_name "$az_service_connection_name" '.[] | select(.name==$sc_name) | .id'); then + +#Project ID +project_id=$(az devops project show --project "$AZDO_PROJECT" --organization "$AZDO_ORGANIZATION_URL" --query id -o tsv) + +# Check if the service connection already exists and delete it if found +sc_id=$(az devops service-endpoint list --project "$AZDO_PROJECT" --organization "$AZDO_ORGANIZATION_URL" --query "[?name=='$az_service_connection_name'].id" -o tsv) +if [ -n "$sc_id" ]; then log "Service connection: $az_service_connection_name already exists. Deleting service connection id $sc_id ..." "info" - az devops service-endpoint delete --id "$sc_id" -y -o none + cleanup_federated_credentials "$sc_id" + wait_for_process + + #Delete azdo service connection + delete_response=$(az devops service-endpoint delete --id "$sc_id" --project "$AZDO_PROJECT" --organization "$AZDO_ORGANIZATION_URL" -y ) + if echo "$delete_response" | grep -q "TF400813"; then + log "Failed to delete service connection: $sc_id" "danger" + exit 1 + fi + log "Successfully deleted service connection: $sc_id" + fi -log "Creating Azure service connection Azure DevOps" -sc_id=$(az devops service-endpoint azurerm create \ - --name "$az_service_connection_name" \ - --azure-rm-service-principal-id "$service_principal_id" \ - --azure-rm-subscription-id "$az_sub_id" \ - --azure-rm-subscription-name "$az_sub_name" \ - --azure-rm-tenant-id "$az_sp_tenant_id" --output json | jq -r '.id') - -az devops service-endpoint update \ - --id "$sc_id" \ - --enable-for-all "true" \ - -o none \ No newline at end of file +# JSON config file +cat < ./devops.json +{ + "data": { + "subscriptionId": "$az_sub_id", + "subscriptionName": "$az_sub_name", + "creationMode": "Automatic", + "environment": "AzureCloud", + "scopeLevel": "Subscription" + }, + "name": "$az_service_connection_name", + "type": "azurerm", + "url": "https://management.azure.com/", + "authorization": { + "scheme": "WorkloadIdentityFederation", + "parameters": { + "tenantid": "$TENANT_ID", + "scope": "/subscriptions/$az_sub_id/resourcegroups/$RESOURCE_GROUP_NAME" + } + }, + "isShared": false, + "isReady": true, + "serviceEndpointProjectReferences": [ + { + "description": "", + "name": "$az_service_connection_name", + "projectReference": { + "id": "$project_id", + "name": "$AZDO_PROJECT" + } + } + ] +} +EOF + +log "Create a new service connection" + +# Create the service connection using the Azure DevOps CLI +response=$(az devops service-endpoint create --service-endpoint-configuration ./devops.json --org "$AZDO_ORGANIZATION_URL" -p "$AZDO_PROJECT") +sc_id=$(echo "$response" | jq -r '.id') +log "Created Connection: $sc_id" + +if [ -z "$sc_id" ]; then + log "Failed to create service connection" "danger" + exit 1 +fi + +az devops service-endpoint update --id "$sc_id" --enable-for-all "true" --project "$AZDO_PROJECT" --organization "$AZDO_ORGANIZATION_URL" -o none + +# Remove the JSON config file if exists +if [ -f ./devops.json ]; then + rm ./devops.json + log "Removed the JSON config file: ./devops.json" +else + log "JSON config file does not exist: ./devops.json" +fi \ No newline at end of file diff --git a/e2e_samples/parking_sensors/scripts/deploy_infrastructure.sh b/e2e_samples/parking_sensors/scripts/deploy_infrastructure.sh index 50f9f74de..d840c0d0f 100755 --- a/e2e_samples/parking_sensors/scripts/deploy_infrastructure.sh +++ b/e2e_samples/parking_sensors/scripts/deploy_infrastructure.sh @@ -270,6 +270,7 @@ DATABRICKS_TOKEN=$databricks_aad_token \ DATABRICKS_HOST=$databricks_host \ KEYVAULT_DNS_NAME=$kv_dns_name \ USER_NAME=$kv_owner_name \ +AZURE_LOCATION=$AZURE_LOCATION \ KEYVAULT_RESOURCE_ID=$(echo "$arm_output" | jq -r '.properties.outputs.keyvault_resource_id.value') \ bash -c "./scripts/configure_databricks.sh" diff --git a/single_tech_samples/databricks/README.md b/single_tech_samples/databricks/README.md new file mode 100644 index 000000000..88ba297e3 --- /dev/null +++ b/single_tech_samples/databricks/README.md @@ -0,0 +1,7 @@ +# Azure Databricks + +[Azure Databricks](https://docs.microsoft.com/en-us/azure/databricks/) is a data analytics platform optimized for the Microsoft Azure cloud services platform which lets you set up your Apache Sparkā„¢ environment in minutes, and enable you to autoscale, and collaborate on shared projects in an interactive workspace. + +## Samples + +- [IaC Deployment of Azure Databricks using Terraform](./databricks_terraform/README.md) - This sample demonstrates how to deploy an Azure Databricks environment using Terraform and promote the source code using Databricks Asset Bundles to different environments. \ No newline at end of file diff --git a/single_tech_samples/databricks/databricks_terraform/.devcontainer/Dockerfile b/single_tech_samples/databricks/databricks_terraform/.devcontainer/Dockerfile new file mode 100644 index 000000000..ccd25bcd6 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/.devcontainer/Dockerfile @@ -0,0 +1,18 @@ +# Use Ubuntu Image +FROM mcr.microsoft.com/devcontainers/python:3.11-bullseye + +# Update and install required system dependencies +RUN apt update \ + && apt install -y sudo vim software-properties-common curl unzip\ + && apt clean + +# Copy and install dev dependencies +COPY requirements-dev.txt /tmp/requirements-dev.txt +RUN pip install -r /tmp/requirements-dev.txt && \ + rm /tmp/requirements-dev.txt + +# Set the working directory +WORKDIR /workspace + +# Default command +CMD ["/bin/bash"] diff --git a/single_tech_samples/databricks/databricks_terraform/.devcontainer/devcontainer.json b/single_tech_samples/databricks/databricks_terraform/.devcontainer/devcontainer.json new file mode 100644 index 000000000..213771b48 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/.devcontainer/devcontainer.json @@ -0,0 +1,25 @@ +{ + "name": "Python DevContainer", + "dockerFile": "Dockerfile", + "context": "..", + "features": { + "ghcr.io/devcontainers/features/terraform:1": { + "installTerrafromDocs": true + }, + "ghcr.io/devcontainers/features/azure-cli:1": { + "extensions": "" + }, + "ghcr.io/devcontainers/features/github-cli:1": {}, + "ghcr.io/audacioustux/devcontainers/taskfile:1": {} + }, + "customizations" :{ + "vscode": { + "extensions": [ + "yzhang.markdown-all-in-one", + "DavidAnson.vscode-markdownlint", + "-dbaeumer.vscode-eslint" + ] + } + } + } + \ No newline at end of file diff --git a/single_tech_samples/databricks/databricks_terraform/.github/workflows/adb-asset-bundle-dev-deployment.yml b/single_tech_samples/databricks/databricks_terraform/.github/workflows/adb-asset-bundle-dev-deployment.yml new file mode 100644 index 000000000..a26ca28cb --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/.github/workflows/adb-asset-bundle-dev-deployment.yml @@ -0,0 +1,61 @@ +name: "Asset Bundle Dev Deployment" + +on: + workflow_run: + workflows: ["Asset Bundle Sandbox Deployment"] + types: + - completed + +env: + ENV: dev + WORKING_DIR: single_tech_samples/databricks/databricks_terraform/ + +jobs: + deploy: + name: "Deploy bundle" + runs-on: ubuntu-latest + environment: development + defaults: + run: + working-directory: ${{ env.WORKING_DIR }} + if: | + github.event.workflow_run.conclusion == 'success' && + github.event.workflow_run.head_branch == 'main' + + steps: + - name: Checkout Repository + uses: actions/checkout@v4 + + - name: Setup Databricks CLI + uses: databricks/setup-cli@main + + - name: Azure Login Using Service Principal + uses: azure/login@v2 + with: + creds: ${{ secrets.AZURE_DEV_CREDENTIALS }} + + - name: Deploy Databricks Bundle + run: | + databricks bundle validate -t ${{ env.ENV }} -o json + databricks bundle deploy -t ${{ env.ENV }} + working-directory: . + env: + DATABRICKS_BUNDLE_ENV: ${{ env.ENV }} + + - name: Install Task + uses: arduino/setup-task@v2 + with: + version: 3.x + repo-token: ${{ secrets.GITHUB_TOKEN }} + + - name: Set Test Flows + run: task collect-tests + + - name: Run test workflows + run: task run-tests + env: + # gets test_flows from Set Test Flows step + # and passes to the run-tests task + test_flows: ${{ env.test_flows }} + # bundle file required variables + DATABRICKS_BUNDLE_ENV: ${{ env.ENV }} diff --git a/single_tech_samples/databricks/databricks_terraform/.github/workflows/adb-asset-bundle-linting.yml b/single_tech_samples/databricks/databricks_terraform/.github/workflows/adb-asset-bundle-linting.yml new file mode 100644 index 000000000..f7001709a --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/.github/workflows/adb-asset-bundle-linting.yml @@ -0,0 +1,36 @@ +name: "ADB Asset Bundle CI Linting" + +on: + pull_request: + branches: + - main + paths: + - "single_tech_samples/databricks/databricks_terraform/**" + +env: + UV_VERSION: ">=0.4.26" + PYTHON_VERSION: "3.11" + +jobs: + linting: + runs-on: ubuntu-latest + + steps: + - name: Checkout the repository + uses: actions/checkout@v4 + + - name: Install uv + uses: astral-sh/setup-uv@v3 + with: + enable-cache: true + version: ${{ env.UV_VERSION }} + cache-dependency-glob: "**/requirements**.txt" + + - name: Install Python and Dependencies + run: | + uv python install ${{ env.PYTHON_VERSION }} + uv tool install ruff + + - name: Run Ruff Lint + run: | + uv run ruff check single_tech_samples/databricks/databricks_terraform diff --git a/single_tech_samples/databricks/databricks_terraform/.github/workflows/adb-asset-bundle-sandbox-deployment.yml b/single_tech_samples/databricks/databricks_terraform/.github/workflows/adb-asset-bundle-sandbox-deployment.yml new file mode 100644 index 000000000..cfecc33d5 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/.github/workflows/adb-asset-bundle-sandbox-deployment.yml @@ -0,0 +1,70 @@ +name: "Asset Bundle Sandbox Deployment" + +on: + push: + branches: + - main + paths: + - "single_tech_samples/databricks/databricks_terraform/**" + pull_request: + branches: + - main + paths: + - "single_tech_samples/databricks/databricks_terraform/**" + +env: + ENV: sandbox + WORKING_DIR: single_tech_samples/databricks/databricks_terraform/ + +jobs: + deploy: + name: "Deploy bundle" + runs-on: ubuntu-latest + environment: sandbox + + defaults: + run: + working-directory: ${{ env.WORKING_DIR }} + + steps: + - name: Checkout Repository + uses: actions/checkout@v4 + + - name: Setup Databricks CLI + uses: databricks/setup-cli@main + + - name: Azure Login Using Service Principal + uses: azure/login@v2 + with: + creds: ${{ secrets.AZURE_INT_CREDENTIALS }} + + - name: Deploy Databricks Bundle + run: | + if [ "${{ github.event_name }}" == "pull_request" ]; then + databricks bundle validate -t ${{ env.ENV }} -o json + elif [ "${{ github.event_name }}" == "push" ]; then + databricks bundle deploy -t ${{ env.ENV }} -o json + fi + env: + DATABRICKS_BUNDLE_ENV: ${{ env.ENV }} + + - name: Install Task + if: github.event_name == 'push' + uses: arduino/setup-task@v2 + with: + version: 3.x + repo-token: ${{ secrets.GITHUB_TOKEN }} + + - name: Set Test Flows + if: github.event_name == 'push' + run: task collect-tests + + - name: Run test workflows + if: github.event_name == 'push' + run: task run-tests + env: + # gets test_flows from Set Test Flows step + # and passes to the run-tests task + test_flows: ${{ env.test_flows }} + # bundle file required variables + DATABRICKS_BUNDLE_ENV: ${{ env.ENV }} diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/README.md b/single_tech_samples/databricks/databricks_terraform/Infra/README.md new file mode 100644 index 000000000..48ce4b6c8 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/README.md @@ -0,0 +1,70 @@ +# Terraform Code for Multi Environment Databricks Medallion Deployment + +![Multi Environment Image](../images/architecture.png) + +[Visio Drawing](https://microsoft.sharepoint.com/:u:/t/ExternalEcolabKitchenOS/EWM3kB69NGBBiy2s563pjJ0BeKWy1qgtgEznRvvufiseFg?e=RieWOu) + +## Overview + +**`Infra/modules`** folder has three modules: +- **`adb-workspace`** - Deploys Databricks workspace +- **`metastore-and-users`** - Creates Databricks Connector, Creates Storage Account, Give storage access rights to connector, Creates Metastore / Assigns Workspace to Metastore, and Finally Retrieves alls users, groups, and service principals from Azure AD. +- **`adb-unity-catalog`** - Gives databricks access rights to the connector, Creates containers in the storage account, and creates external locations for the containers. Creates unity catalog and grants permissions user groups. Finally, creates **`bronze` `silver` `gold`** schemas under the catalog and gives the required permissions to the user groups. + +**NOTE** - *When **`adb-workspace`** module runs it creates databricks workspace, and by default it creates a metastore in the same region. Databricks allows only **ONE METASTORE** per region. **`metastore-and-users`** module deploys new metastore with our required configurations, but we have to delete existing metastore prior running the module* + +**NOTE** - *During script execution you will receive `Error: cannot create metastore: This account with id has reached the limit for metastores in region ` * error. This is because we have reached the limit of metastores in the region. To fix this, we need to delete existing metastore and re-run the script.* + +## How to Run + +### Pre-requisites +- `Infra/deployment/.env` - Update the values as per your requirement +- Have databricks admin level access. Login to get databricks account id [accounts.databricks.net](https://accounts.azuredatabricks.net/) + +### Steps + +1. Login to Azure +```bash +az login +``` + +2. Set the subscription +```bash +az account set --subscription +``` + +3. Change directory to `Infra/deployment` +```bash +cd Infra/deployment +``` + +4. Make the script executable +```bash +chmod +x dev.deploy.sh +``` + +5. Run the script to deploy the modules sequentially +```bash +./dev.deploy.sh +``` + +## Destroy + +### Steps + +1. Change directory to `Infra/deployment` +```bash +cd Infra/deployment +``` +2. Make the script executable +```bash +chmod +x dev.destroy.sh +``` +3. Run the script to destroy the modules by passing +```bash +./dev.destroy.sh --destroy +``` + +## Error Handling + +In case of any script fails during resource creation, try rerun the script. It will reference the local state files, and will try again to create the resources. diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/deployment/.env_example b/single_tech_samples/databricks/databricks_terraform/Infra/deployment/.env_example new file mode 100644 index 000000000..3c6d40455 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/deployment/.env_example @@ -0,0 +1,11 @@ +region="" +environment="dev" +subscription_id="" +resource_group_name="" +metastore_name="metastore_azure_" # example: metastore_azure_eastus2 +account_id="" # login https://accounts.azuredatabricks.net/ to get the account id. +prefix="dev" + +# Ensure these groups exist in Azure EntraId. +# Make sure you are a member of account_unity_admin group when running a script locally. +aad_groups='["account_unity_admin","data_engineer","data_analyst","data_scientist"]' \ No newline at end of file diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/deployment/dev.deploy.sh b/single_tech_samples/databricks/databricks_terraform/Infra/deployment/dev.deploy.sh new file mode 100755 index 000000000..5e83c2717 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/deployment/dev.deploy.sh @@ -0,0 +1,92 @@ +#!/bin/bash + +# Load environment variables from .env file +if [[ -f .env ]]; then + echo "Loading environment variables from .env file" + source .env +else + echo ".env file not found!" + exit 1 +fi + +# Function to deploy a Terraform module +deploy_module() { + local module_path="$1" + shift + echo "Deploying module: ${module_path}" + + pushd "${module_path}" || exit 1 + + terraform init || { echo "Terraform init failed"; exit 1; } + terraform apply -auto-approve -var="region=${region}" -var="environment=${environment}" -var="subscription_id=${subscription_id}" "$@" || { + echo "Terraform apply failed in ${module_path}"; exit 1; + } + + popd || exit 1 +} + +# Deploy Azure Databricks Workspace +workspace_module_path="../modules/adb-workspace" +deploy_module "${workspace_module_path}" \ + -var="resource_group_name=${resource_group_name}" + +# Capture Workspace Outputs +workspace_name=$(terraform -chdir="${workspace_module_path}" output -raw databricks_workspace_name) +workspace_resource_group=$(terraform -chdir="${workspace_module_path}" output -raw resource_group) +workspace_host_url=$(terraform -chdir="${workspace_module_path}" output -raw databricks_workspace_host_url) +workspace_id=$(terraform -chdir="${workspace_module_path}" output -raw databricks_workspace_id) + +# Export workspace outputs as environment variables +export TF_VAR_databricks_workspace_name="${workspace_name}" +export TF_VAR_resource_group="${workspace_resource_group}" +export TF_VAR_databricks_workspace_host_url="${workspace_host_url}" +export TF_VAR_databricks_workspace_id="${workspace_id}" + +# Deploy Metastore and Users +metastore_module_path="../modules/metastore-and-users" +deploy_module "${metastore_module_path}" \ + -var="resource_group=${workspace_resource_group}" \ + -var="databricks_workspace_name=${workspace_name}" \ + -var="databricks_workspace_host_url=${workspace_host_url}" \ + -var="databricks_workspace_id=${workspace_id}" \ + -var="metastore_name=${metastore_name}" \ + -var="aad_groups=${aad_groups}" \ + -var="account_id=${account_id}" \ + -var="prefix=${prefix}" + +# Capture Metastore Outputs +metastore_id=$(terraform -chdir="${metastore_module_path}" output -raw metastore_id) +azurerm_storage_account_unity_catalog_id=$(terraform -chdir="${metastore_module_path}" output -json | jq -r '.azurerm_storage_account_unity_catalog.value.id') +azure_storage_account_name="${azurerm_storage_account_unity_catalog_id##*/}" # get storage account name +azurerm_databricks_access_connector_id=$(terraform -chdir="${metastore_module_path}" output -json | jq -r '.azurerm_databricks_access_connector_id.value') +databricks_groups=$(terraform -chdir="${metastore_module_path}" output -json databricks_groups) +databricks_users=$(terraform -chdir="${metastore_module_path}" output -json databricks_users) +databricks_sps=$(terraform -chdir="${metastore_module_path}" output -json databricks_sps) + +# Export metastore outputs as environment variables +export TF_VAR_metastore_id="${metastore_id}" +export TF_VAR_azurerm_storage_account_unity_catalog_id="${azurerm_storage_account_unity_catalog_id}" +export TF_VAR_azure_storage_account_name="${azure_storage_account_name}" +export TF_VAR_azurerm_databricks_access_connector_id="${azurerm_databricks_access_connector_id}" +export TF_VAR_databricks_groups="${databricks_groups}" +export TF_VAR_databricks_users="${databricks_users}" +export TF_VAR_databricks_sps="${databricks_sps}" + +# Deploy Unity Catalog +unity_catalog_module_path="../modules/adb-unity-catalog" +deploy_module "${unity_catalog_module_path}" \ + -var="environment=${environment}" \ + -var="subscription_id=${subscription_id}" \ + -var="databricks_workspace_host_url=${workspace_host_url}" \ + -var="databricks_workspace_id=${workspace_id}" \ + -var="metastore_id=${metastore_id}" \ + -var="azurerm_storage_account_unity_catalog_id=${azurerm_storage_account_unity_catalog_id}" \ + -var="azure_storage_account_name=${azure_storage_account_name}" \ + -var="azurerm_databricks_access_connector_id=${azurerm_databricks_access_connector_id}" \ + -var="databricks_groups=${databricks_groups}" \ + -var="databricks_users=${databricks_users}" \ + -var="databricks_sps=${databricks_sps}" \ + -var="aad_groups=${aad_groups}" \ + -var="account_id=${account_id}" + +echo "All resources deployed successfully." \ No newline at end of file diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/deployment/dev.destroy.sh b/single_tech_samples/databricks/databricks_terraform/Infra/deployment/dev.destroy.sh new file mode 100755 index 000000000..373086db8 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/deployment/dev.destroy.sh @@ -0,0 +1,88 @@ +#!/bin/bash + +load_env_variables() { + if [[ -f .env ]]; then + echo "Loading environment variables from .env file" + source .env + else + echo ".env file not found!" + exit 1 + fi +} + + +check_resource_group_name() { + if [[ -z "${resource_group_name}" ]]; then + echo "Resource group name is not defined in .env file!" + exit 1 + fi +} + + +check_resource_group() { + az group exists --name "$1" +} + + +delete_resource_group() { + echo "Checking if Azure resource group exists: ${resource_group_name}" + + resource_group_exists=$(check_resource_group "${resource_group_name}") + + if [[ "${resource_group_exists}" == "true" ]]; then + echo "Deleting Azure resource group: ${resource_group_name}" + if ! az group delete --name "${resource_group_name}" --yes --no-wait; then + echo "Failed to delete resource group: ${resource_group_name}" + fi + wait_for_deletion + else + echo "Resource group ${resource_group_name} does not exist. Skipping deletion." + fi +} + + +wait_for_deletion() { + echo "Waiting for resource group deletion to complete..." + start_time=$(date +%s) + + while [[ "$(check_resource_group "${resource_group_name}")" == "true" ]]; do + current_time=$(date +%s) + elapsed_time=$((current_time - start_time)) + echo "Resource group ${resource_group_name} still exists. Time elapsed: ${elapsed_time} seconds. Checking again in 10 seconds..." + sleep 10 + done + + total_time=$(( $(date +%s) - start_time )) + echo "Resource group ${resource_group_name} deleted successfully in ${total_time} seconds." +} + +cleanup_terraform_states() { + local modules=( + "../modules/adb-workspace" + "../modules/metastore-and-users" + "../modules/adb-unity-catalog" + ) + + for module in "${modules[@]}"; do + if [[ -d "${module}" ]]; then + echo "Cleaning up Terraform state files in module: ${module}" + rm -f "${module}/terraform.tfstate" "${module}/terraform.tfstate.backup" + rm -rf "${module}/.terraform" + echo "State files cleaned up in module: ${module}" + else + echo "Module path does not exist: ${module}" + fi + done +} + + +main() { + load_env_variables + check_resource_group_name + delete_resource_group + cleanup_terraform_states + echo "All tasks completed successfully." +} + + +main diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-unity-catalog/locals.tf b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-unity-catalog/locals.tf new file mode 100644 index 000000000..e493c2c6a --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-unity-catalog/locals.tf @@ -0,0 +1,36 @@ +locals { + # data layers with names for storage containers and external locations + data_layers = [ + { + name = "landing" + storage_container = "landing" + external_location = "landing" + }, + { + name = "bronze" + storage_container = "bronze" + external_location = "bronze" + }, + { + name = "silver" + storage_container = "silver" + external_location = "silver" + }, + { + name = "gold" + storage_container = "gold" + external_location = "gold" + }, + { + name = "checkpoints" + storage_container = "checkpoints" + external_location = "checkpoints" + } + ] + + # Catalog and environment configuration + catalog_name = "${var.environment}_catalog" + environment = var.environment + merged_user_sp = merge(var.databricks_users, var.databricks_sps) + aad_groups = toset(var.aad_groups) +} diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-unity-catalog/main.tf b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-unity-catalog/main.tf new file mode 100644 index 000000000..0acc5b00b --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-unity-catalog/main.tf @@ -0,0 +1,221 @@ +# Generate a random string to append to the managed identity name +resource "random_string" "mi_suffix" { + length = 4 + upper = false + special = false +} + +# Storage credentials for external locations +resource "databricks_storage_credential" "external_mi" { + name = "${local.environment}-mi-credential-${random_string.mi_suffix.result}" + + azure_managed_identity { + access_connector_id = var.azurerm_databricks_access_connector_id + } + + owner = "account_unity_admin" + comment = "Storage credential for all external locations" +} + +# Create storage containers explicitly for each data layer +resource "azurerm_storage_container" "landing" { + name = local.data_layers[0].storage_container + storage_account_id = var.azurerm_storage_account_unity_catalog_id + container_access_type = "private" +} + +resource "azurerm_storage_container" "bronze" { + name = local.data_layers[1].storage_container + storage_account_id = var.azurerm_storage_account_unity_catalog_id + container_access_type = "private" +} + +resource "azurerm_storage_container" "silver" { + name = local.data_layers[2].storage_container + storage_account_id = var.azurerm_storage_account_unity_catalog_id + container_access_type = "private" +} + +resource "azurerm_storage_container" "gold" { + name = local.data_layers[3].storage_container + storage_account_id = var.azurerm_storage_account_unity_catalog_id + container_access_type = "private" +} + +resource "azurerm_storage_container" "checkpoint" { + name = local.data_layers[4].storage_container + storage_account_id = var.azurerm_storage_account_unity_catalog_id + container_access_type = "private" +} + +resource "time_sleep" "wait_seconds" { + depends_on = [azurerm_storage_container.landing, + azurerm_storage_container.bronze, + azurerm_storage_container.silver, + azurerm_storage_container.gold, + azurerm_storage_container.checkpoint + ] + create_duration = "30s" +} + +# Create external locations linked to the storage containers +resource "databricks_external_location" "landing" { + name = local.data_layers[0].external_location + url = format("abfss://%s@%s.dfs.core.windows.net/", local.data_layers[0].storage_container, var.azure_storage_account_name) + credential_name = databricks_storage_credential.external_mi.id + owner = "account_unity_admin" + comment = "External location for landing container" + + depends_on = [ time_sleep.wait_seconds ] +} + +resource "databricks_external_location" "bronze" { + name = local.data_layers[1].external_location + url = format("abfss://%s@%s.dfs.core.windows.net/", local.data_layers[1].storage_container, var.azure_storage_account_name) + credential_name = databricks_storage_credential.external_mi.id + owner = "account_unity_admin" + comment = "External location for bronze container" + + depends_on = [ time_sleep.wait_seconds ] +} + +resource "databricks_external_location" "silver" { + name = local.data_layers[2].external_location + url = format("abfss://%s@%s.dfs.core.windows.net/", local.data_layers[2].storage_container, var.azure_storage_account_name) + credential_name = databricks_storage_credential.external_mi.id + owner = "account_unity_admin" + comment = "External location for silver container" + + depends_on = [ time_sleep.wait_seconds ] +} + +resource "databricks_external_location" "gold" { + name = local.data_layers[3].external_location + url = format("abfss://%s@%s.dfs.core.windows.net/", local.data_layers[3].storage_container, var.azure_storage_account_name) + credential_name = databricks_storage_credential.external_mi.id + owner = "account_unity_admin" + comment = "External location for gold container" + + depends_on = [ time_sleep.wait_seconds ] +} + +resource "databricks_external_location" "checkpoint" { + name = local.data_layers[4].external_location + url = format("abfss://%s@%s.dfs.core.windows.net/", local.data_layers[4].storage_container, var.azure_storage_account_name) + credential_name = databricks_storage_credential.external_mi.id + owner = "account_unity_admin" + comment = "External location for checkpoint container" + + depends_on = [ time_sleep.wait_seconds ] +} + +# Create a catalog associated with the landing external location +resource "databricks_catalog" "environment" { + metastore_id = var.metastore_id + name = local.catalog_name + comment = "Catalog for ${local.environment} environment" + owner = "account_unity_admin" + + storage_root = replace(databricks_external_location.landing.url, "/$", "") + + properties = { + purpose = var.environment + } +} + +# Apply catalog-level grants +resource "databricks_grants" "environment_catalog" { + catalog = databricks_catalog.environment.name + + # Standard grants for all roles + grant { + principal = "data_engineer" + privileges = ["USE_CATALOG"] + } + + grant { + principal = "data_scientist" + privileges = ["USE_CATALOG"] + } + + grant { + principal = "data_analyst" + privileges = ["USE_CATALOG"] + } +} + +# Create schemas explicitly for each data layer +# Bronze, Silver, Gold +resource "databricks_schema" "bronze_schema" { + catalog_name = databricks_catalog.environment.id + name = local.data_layers[1].name + owner = "account_unity_admin" + comment = "Schema for bronze layer in ${local.catalog_name}" +} + +resource "databricks_schema" "silver_schema" { + catalog_name = databricks_catalog.environment.id + name = local.data_layers[2].name + owner = "account_unity_admin" + comment = "Schema for silver layer in ${local.catalog_name}" +} + +resource "databricks_schema" "gold_schema" { + catalog_name = databricks_catalog.environment.id + name = local.data_layers[3].name + owner = "account_unity_admin" + comment = "Schema for gold layer in ${local.catalog_name}" +} + +# Grant permissions on each schema +# Bronze SIlver Gold +resource "databricks_grants" "bronze_schema_permissions" { + schema = databricks_schema.bronze_schema.id + + # Standard grants for bronze schema + grant { + principal = "data_engineer" + privileges = ["USE_SCHEMA", "CREATE_FUNCTION", "CREATE_TABLE", "EXECUTE", "MODIFY", "SELECT"] + } + + grant { + principal = "data_scientist" + privileges = ["USE_SCHEMA", "SELECT"] + } +} + +resource "databricks_grants" "silver_schema_permissions" { + schema = databricks_schema.silver_schema.id + + # Standard grants for silver schema + grant { + principal = "data_engineer" + privileges = ["USE_SCHEMA", "CREATE_FUNCTION", "CREATE_TABLE", "EXECUTE", "MODIFY", "SELECT"] + } + + grant { + principal = "data_scientist" + privileges = ["USE_SCHEMA", "SELECT"] + } +} + +resource "databricks_grants" "gold_schema_permissions" { + schema = databricks_schema.gold_schema.id + + # Standard grants for gold schema + grant { + principal = "data_engineer" + privileges = ["USE_SCHEMA", "CREATE_FUNCTION", "CREATE_TABLE", "EXECUTE", "MODIFY", "SELECT"] + } + + grant { + principal = "data_scientist" + privileges = ["USE_SCHEMA", "SELECT"] + } + + # Additional grants for data_analyst on the gold schema + grant { + principal = "data_analyst" + privileges = ["USE_SCHEMA", "SELECT"] + } +} \ No newline at end of file diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-unity-catalog/providers.tf b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-unity-catalog/providers.tf new file mode 100644 index 000000000..891d3c02b --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-unity-catalog/providers.tf @@ -0,0 +1,26 @@ +terraform { + required_providers { + azurerm = { + source = "hashicorp/azurerm" + } + databricks = { + source = "databricks/databricks" + } + } +} + +provider "azurerm" { + subscription_id = var.subscription_id + features {} +} + +provider "databricks" { + alias = "azure_account" + host = "https://accounts.azuredatabricks.net" + account_id = var.account_id + auth_type = "azure-cli" +} + +provider "databricks" { + host = var.databricks_workspace_host_url +} \ No newline at end of file diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-unity-catalog/variables.tf b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-unity-catalog/variables.tf new file mode 100644 index 000000000..f974f3419 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-unity-catalog/variables.tf @@ -0,0 +1,75 @@ +variable "region" { + description = "Azure region" + type = string +} + +variable "account_id" { + description = "Azure databricks account id" +} + +variable "subscription_id" { + description = "Azure subscription id" +} + +variable "aad_groups" { + description = "List of AAD groups that you want to add to Databricks account" + type = list(string) +} + +variable "environment" { + description = "The environment to deploy (dev, stg, prod)" + type = string +} + +variable "databricks_groups" { + description = "Map of AAD group object id to Databricks group id" + type = map(string) +} + +variable "databricks_users" { + description = "Map of AAD user object id to Databricks user id" + type = map(string) +} + +variable "databricks_sps" { + description = "Map of AAD service principal object id to Databricks service principal id" + type = map(string) +} + +variable "databricks_workspace_id" { + description = "Azure databricks workspace id" + type = string +} + +variable "azurerm_databricks_access_connector_id" { + description = "Azure databricks access connector id" + type = string +} + +variable "metastore_id" { + description = "Azure databricks metastore id" + type = string +} + +variable "azurerm_storage_account_unity_catalog_id" { + description = "Azure storage account for Unity catalog" +} + +variable "databricks_workspace_host_url" { + description = "Databricks workspace host url" + type = string + +} + +variable "azure_storage_account_name" { + description = "Azure storage account name" + type = string +} + +data "azuread_group" "this" { + for_each = local.aad_groups + display_name = each.value +} + +data "azurerm_client_config" "current" { +} \ No newline at end of file diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-workspace/locals.tf b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-workspace/locals.tf new file mode 100644 index 000000000..d2c204aa7 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-workspace/locals.tf @@ -0,0 +1,3 @@ +locals { + prefix = "managed-databricks${var.environment}" +} \ No newline at end of file diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-workspace/main.tf b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-workspace/main.tf new file mode 100644 index 000000000..8a330d949 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-workspace/main.tf @@ -0,0 +1,13 @@ +resource "azurerm_resource_group" "this" { + name = var.resource_group_name + location = var.region +} + +resource "azurerm_databricks_workspace" "this" { + name = var.resource_group_name + resource_group_name = azurerm_resource_group.this.name + location = azurerm_resource_group.this.location + sku = "premium" + depends_on = [ azurerm_resource_group.this ] +} + diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-workspace/outputs.tf b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-workspace/outputs.tf new file mode 100644 index 000000000..ccaffed24 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-workspace/outputs.tf @@ -0,0 +1,15 @@ +output "databricks_workspace_host_url" { + value = "https://${azurerm_databricks_workspace.this.workspace_url}/" +} + +output "databricks_workspace_name" { + value = azurerm_databricks_workspace.this.name +} + +output "resource_group" { + value = azurerm_resource_group.this.name +} + +output "databricks_workspace_id" { + value = azurerm_databricks_workspace.this.workspace_id +} \ No newline at end of file diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-workspace/providers.tf b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-workspace/providers.tf new file mode 100644 index 000000000..a88caba55 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-workspace/providers.tf @@ -0,0 +1,15 @@ +terraform { + required_providers { + azurerm = { + source = "hashicorp/azurerm" + } + databricks = { + source = "databricks/databricks" + } + } +} + +provider "azurerm" { + features {} + subscription_id = var.subscription_id +} diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-workspace/variables.tf b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-workspace/variables.tf new file mode 100644 index 000000000..ed2460a47 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/modules/adb-workspace/variables.tf @@ -0,0 +1,19 @@ +variable "region" { + description = "Azure Region" + type = string +} + +variable "environment" { + description = "Environment" + type = string +} + +variable "subscription_id" { + description = "Azure Subscription ID" + type = string +} + +variable "resource_group_name" { + description = "Azure Resource Group Name" + type = string +} \ No newline at end of file diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/modules/metastore-and-users/locals.tf b/single_tech_samples/databricks/databricks_terraform/Infra/modules/metastore-and-users/locals.tf new file mode 100644 index 000000000..f0066d11e --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/modules/metastore-and-users/locals.tf @@ -0,0 +1,21 @@ +locals { + aad_groups = toset(var.aad_groups) + + all_members = toset(flatten([for group in values(data.azuread_group.this) : group.members])) + + all_users = { + for user in data.azuread_users.users.users : user.object_id => user + } + + all_spns = { + for sp in data.azuread_service_principals.spns.service_principals : sp.object_id => sp + } + + account_admin_members = toset(flatten([for group in values(data.azuread_group.this) : [group.display_name == "account_unity_admin" ? group.members : []]])) + + all_account_admin_users = { + for user in data.azuread_users.account_admin_users.users : user.object_id => user + } + + metastore_id = length(data.databricks_metastore.existing.id) > 0 ? data.databricks_metastore.existing.id : databricks_metastore.this[0].id +} diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/modules/metastore-and-users/main.tf b/single_tech_samples/databricks/databricks_terraform/Infra/modules/metastore-and-users/main.tf new file mode 100644 index 000000000..f059e672a --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/modules/metastore-and-users/main.tf @@ -0,0 +1,135 @@ +// Create azure managed identity to be used by unity catalog metastore +resource "azurerm_databricks_access_connector" "unity" { + name = "${var.prefix}-databricks-mi" + resource_group_name = data.azurerm_resource_group.this.name + location = data.azurerm_resource_group.this.location + identity { + type = "SystemAssigned" + } +} + +# Generate a random string for uniqueness +resource "random_string" "unique" { + length = 6 + special = false + upper = false +} + +// Create a storage account to be used by unity catalog metastore as root storage +resource "azurerm_storage_account" "unity_catalog" { + name = substr(lower(replace("databrickssa${random_string.unique.result}", "/[^a-z0-9]/", "")), 0, 24) + resource_group_name = var.resource_group + location = var.region + account_tier = "Standard" + account_replication_type = "LRS" + + is_hns_enabled = true + shared_access_key_enabled = false + default_to_oauth_authentication = true + + identity { + type = "SystemAssigned" + } +} + +// Create a container in storage account to be used by unity catalog metastore as root storage +resource "azurerm_storage_container" "unity_catalog" { + name = "${var.prefix}-container" + storage_account_name = azurerm_storage_account.unity_catalog.name + container_access_type = "private" +} + +// Assign the Storage Blob Data Contributor role to managed identity to allow unity catalog to access the storage +resource "azurerm_role_assignment" "mi_data_contributor" { + scope = azurerm_storage_account.unity_catalog.id + role_definition_name = "Storage Blob Data Contributor" + principal_id = azurerm_databricks_access_connector.unity.identity[0].principal_id +} + +// Use existing metastore or create one if does not exist +resource "databricks_metastore" "this" { + count = length(data.databricks_metastore.existing.id) == 0 ? 1 : 0 + + name = var.metastore_name + storage_root = format("abfss://%s@%s.dfs.core.windows.net/", + azurerm_storage_container.unity_catalog.name, + azurerm_storage_account.unity_catalog.name) + force_destroy = true + owner = "account_unity_admin" +} + +# Introduce a delay after metastore creation +resource "time_sleep" "wait_30_seconds" { + count = length(data.databricks_metastore.existing.id) == 0 ? 1 : 0 + create_duration = "30s" + + depends_on = [databricks_metastore.this] +} + +// Assign managed identity to metastore skip if already assigned +resource "databricks_metastore_data_access" "first" { + count = length(data.databricks_metastore.existing.id) == 0 ? 1 : 0 + + metastore_id = local.metastore_id + name = "the-metastore-key" + azure_managed_identity { + access_connector_id = azurerm_databricks_access_connector.unity.id + } + is_default = true + + depends_on = [time_sleep.wait_30_seconds] +} + +// Attach the databricks workspace to the metastore +resource "databricks_metastore_assignment" "this" { + workspace_id = var.databricks_workspace_id + metastore_id = local.metastore_id + default_catalog_name = "hive_metastore" +} + +// Add groups to databricks account +resource "databricks_group" "this" { + provider = databricks.azure_account + for_each = data.azuread_group.this + display_name = each.key + external_id = data.azuread_group.this[each.key].object_id + force = true +} + +// All governed by AzureAD, create or remove users to/from databricks account +resource "databricks_user" "this" { + provider = databricks.azure_account + for_each = local.all_users + user_name = lower(local.all_users[each.key]["user_principal_name"]) + display_name = local.all_users[each.key]["display_name"] + active = local.all_users[each.key]["account_enabled"] + external_id = each.key + force = true + disable_as_user_deletion = true # default behavior + + // Review warning before deactivating or deleting users from databricks account + // https://learn.microsoft.com/en-us/azure/databricks/administration-guide/users-groups/scim/#add-users-and-groups-to-your-azure-databricks-account-using-azure-active-directory-azure-ad + lifecycle { + prevent_destroy = false + } +} + +// All governed by AzureAD, create or remove service to/from databricks account +resource "databricks_service_principal" "sp" { + provider = databricks.azure_account + for_each = local.all_spns + application_id = local.all_spns[each.key]["application_id"] + display_name = local.all_spns[each.key]["display_name"] + active = local.all_spns[each.key]["account_enabled"] + external_id = each.key + force = true +} + +// Making all users on account_unity_admin group as databricks account admin +resource "databricks_user_role" "account_admin" { + provider = databricks.azure_account + for_each = local.all_account_admin_users + user_id = databricks_user.this[each.key].id + role = "account_admin" + depends_on = [databricks_group.this, databricks_user.this, databricks_service_principal.sp] +} diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/modules/metastore-and-users/outputs.tf b/single_tech_samples/databricks/databricks_terraform/Infra/modules/metastore-and-users/outputs.tf new file mode 100644 index 000000000..48c7eb3e2 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/modules/metastore-and-users/outputs.tf @@ -0,0 +1,28 @@ +output "databricks_groups" { + value = { + for group in databricks_group.this : group.external_id => group.id + } +} +output "databricks_users" { + value = { + for user in databricks_user.this : user.external_id => user.id + } +} +output "databricks_sps" { + value = { + for sp in databricks_service_principal.sp : sp.external_id => sp.id + } +} + +output "azurerm_storage_account_unity_catalog" { + value = azurerm_storage_account.unity_catalog + sensitive = true +} + +output "azurerm_databricks_access_connector_id" { + value = azurerm_databricks_access_connector.unity.id +} + +output "metastore_id"{ + value = local.metastore_id +} \ No newline at end of file diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/modules/metastore-and-users/provider.tf b/single_tech_samples/databricks/databricks_terraform/Infra/modules/metastore-and-users/provider.tf new file mode 100644 index 000000000..b2acc11cf --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/modules/metastore-and-users/provider.tf @@ -0,0 +1,29 @@ +terraform { + required_providers { + azurerm = { + source = "hashicorp/azurerm" + } + databricks = { + source = "databricks/databricks" + } + } +} + +provider "azurerm" { + subscription_id = var.subscription_id + features {} + storage_use_azuread = true +} + +// Provider for databricks workspace +provider "databricks" { + host = var.databricks_workspace_host_url +} + +// Initialize provider at Azure account-level +provider "databricks" { + alias = "azure_account" + host = "https://accounts.azuredatabricks.net" + account_id = var.account_id + auth_type = "azure-cli" +} \ No newline at end of file diff --git a/single_tech_samples/databricks/databricks_terraform/Infra/modules/metastore-and-users/variables.tf b/single_tech_samples/databricks/databricks_terraform/Infra/modules/metastore-and-users/variables.tf new file mode 100644 index 000000000..da28efe0c --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Infra/modules/metastore-and-users/variables.tf @@ -0,0 +1,81 @@ +variable "databricks_workspace_name" { + description = "Azure databricks workspace name" +} +variable "environment" { + description = "The environment to deploy (dev, stg, prod)" + type = string +} + +variable "subscription_id" { + description = "Azure subscription id" +} + +variable "resource_group" { + description = "Azure resource group" +} + +variable "region" { + description = "Azure region" + type = string + +} +variable "aad_groups" { + description = "List of AAD groups that you want to add to Databricks account" + type = list(string) +} +variable "account_id" { + description = "Azure databricks account id" +} +variable "prefix" { + description = "Prefix to be used with resouce names" +} + +variable "databricks_workspace_host_url" { + description = "Databricks workspace host url" +} + +variable "databricks_workspace_id" { + description = "Databricks workspace id" +} + +variable "metastore_name" { + description = "Name of the metastore" +} + +data "azurerm_resource_group" "this" { + name = var.resource_group +} + +data "azurerm_databricks_workspace" "this" { + name = var.databricks_workspace_name + resource_group_name = var.resource_group +} + +// Read group members of given groups from AzureAD every time Terraform is started +data "azuread_group" "this" { + for_each = local.aad_groups + display_name = each.value +} + +// Extract information about real users +data "azuread_users" "users" { + ignore_missing = true + object_ids = local.all_members +} + +// Extract information about service prinicpals users +data "azuread_service_principals" "spns" { + object_ids = toset(setsubtract(local.all_members, data.azuread_users.users.object_ids)) +} + +# Extract information about real account admins users +data "azuread_users" "account_admin_users" { + ignore_missing = true + object_ids = local.account_admin_members +} + +# Get Databricks Metastore +data "databricks_metastore" "existing" { + provider = databricks.azure_account + name = var.metastore_name +} diff --git a/single_tech_samples/databricks/databricks_terraform/README.md b/single_tech_samples/databricks/databricks_terraform/README.md new file mode 100644 index 000000000..e84a5fd3c --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/README.md @@ -0,0 +1,50 @@ +# Overview + +This repository contains Databricks deployment using Terraform and Databricks asset bundle deployment using GitHub Actions. + +## Prerequisites + +- VS Code with Devcontainer extension installed (Optional). If you want to use a container before you proceed, check the instructions [Here](#devcontainer-option): +- Azure Resources Deployed. See how to deploy resources [here](./Infra/README.md) + +## Databricks Asset Bundle Deployment + +In this sample, we will deploy resources from the sandbox environment to the development environment using Databricks Asset Bundle Deployment. + +![Asset Bundle Deployment Pipeline](./images/databricks-asset-bundle-deploymeny-pipeline.png) + +[visio drawing](https://microsoft.sharepoint.com/:u:/t/ExternalEcolabKitchenOS/EWM3kB69NGBBiy2s563pjJ0BeKWy1qgtgEznRvvufiseFg?e=s0Qohq) + +### Folder Structure + +The working directory: `./single_tech_samples/databricks/databricks_terraform` is comprised of the following components: + +- `./Infra` - Terraform code to deploy Azure Resources. +- `tests` - Contains a simple Greeting Python script and a test script. These are used to create Databricks Workflows and run during the GitHub Actions CI/CD pipeline. +- `utils` - Contains the `generate-databricks-workflows.sh` script, which generates Databricks Workflows from the `tests` folder. +- `workflows` - Contains Databricks Workflows generated from the `tests` folder. +- `Taskfile.yml` - Contains two tasks: + - `collect-tests` - Collects all the test workflows during the CI/CD pipeline run. + - `run-tests` - Runs all the collected tests in the Databricks workspace during the CI/CD pipeline run. + +### Pipelines + +- `.github/workflows/adb-asset-bundle-linting.yml` - Pipeline to lint Databricks Python notebooks and workflows. +- `.github/workflows/adb-asset-bundle-sandbox-deployment.yml` - Pipeline to validate Databricks assets during `pull_request` and deploy to the sandbox environment once the PR is merged. Tests are run as part of the deployment in the sandbox environment. +- `.github/workflows/adb-asset-bundle-dev-deployment.yml` - Pipeline to validate, deploy, and run the same test flows in the development environment. It is triggered when the sandbox deployment is successful. + +### Devcontainer Option + +You can use a devcontainer to simplify the development environment setup. The repository includes a `.devcontainer` folder with the necessary configuration files. To use the devcontainer: + +1. Open the repository in VS Code. +2. When prompted, reopen the repository in the container. + +### Steps to see all in action + +#### Steps + +1. Create a new branch from the `main` branch. +2. Create a pull request to merge the new branch to the `main` branch with some small changes. You will see the pipeline `adb-asset-bundle-linting` running and the `adb-asset-bundle-sandbox-deployment` pipeline validating Databricks assets. +3. Once the PR is merged into the `main` branch, the `adb-asset-bundle-sandbox-deployment` pipeline will deploy the Databricks assets to the sandbox environment. You will observe the test workflows running in the Databricks Workflows within the sandbox environment. Once the test workflows are successful, the pipeline will complete successfully. +4. Once the sandbox deployment is successful, the `adb-asset-bundle-dev-deployment` pipeline will be triggered. This pipeline will deploy the Databricks assets to the development environment. You will observe the test workflows running in the Databricks Workflows within the development environment. Once the test workflows are successful, the pipeline will complete successfully. diff --git a/single_tech_samples/databricks/databricks_terraform/Taskfile.yml b/single_tech_samples/databricks/databricks_terraform/Taskfile.yml new file mode 100644 index 000000000..45e650f12 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/Taskfile.yml @@ -0,0 +1,35 @@ +version: "3" + +tasks: + collect-tests: + desc: "Collect test workflows" + dir: "single_tech_samples/databricks/databricks_terraform" + cmds: + - | + echo "Collecting test workflows" + TEST_FLOWS=$(find ./workflows -name "*_test.job.yml" -exec basename {} .job.yml \; | tr '\n' ',' | sed 's/,$//') + if [ -z "$TEST_FLOWS" ]; then + echo "No test workflows found." + exit 0 + fi + echo "Test flows found: $TEST_FLOWS" + echo "test_flows=$TEST_FLOWS" >> $GITHUB_ENV + + run-tests: + desc: "Run Databricks test workflows" + dir: "single_tech_samples/databricks/databricks_terraform" + cmds: + - | + # Read test flows into a variable + TEST_FLOWS="{{.test_flows}}" + + # Set the Internal Field Separator to comma + IFS=',' + + # Loop through each flow and run it separately + for flow in $TEST_FLOWS; do + if [ -n "$flow" ]; then + echo "Running test flow: $flow" + databricks bundle run -t {{.ENV}} "$flow" + fi + done diff --git a/single_tech_samples/databricks/databricks_terraform/databricks.yml b/single_tech_samples/databricks/databricks_terraform/databricks.yml new file mode 100644 index 000000000..26d6663c3 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/databricks.yml @@ -0,0 +1,53 @@ +# Name of the bundle +bundle: + name: modern-data-warehouse-databricks-asset-bundles + +# Including the workflows to be used in the bundle. +# This will deploy the workflows to the Databricks workspace and can be used to run the workflows as part of CI/CD pipeline. +# In this case we are creating test workflows and running test in databricks workspace as part of CI/CD pipeline. +include: + - single_tech_samples/databricks/databricks_terraform/workflows/*.yml + +# Target Environment Configuration +# Each environment has its own resources in Azure. +targets: + # Sandbox + sandbox: + presets: + name_prefix: "sandbox_" + workspace: + host: + root_path: /Workspace/sandbox/${workspace.current_user.userName}/${bundle.name}/${bundle.target} + run_as: + service_principal_name: ${workspace.current_user.userName} + + dev: + presets: + name_prefix: "dev_" + default: true + workspace: + host: + root_path: /Workspace/dev/${workspace.current_user.userName}/${bundle.name}/${bundle.target} + run_as: + service_principal_name: ${workspace.current_user.userName} + + stg: + presets: + name_prefix: "stg_" + default: true + workspace: + host: + root_path: /Workspace/stg/${workspace.current_user.userName}/${bundle.name}/${bundle.target} + run_as: + service_principal_name: ${workspace.current_user.userName} + + prod: + presets: + name_prefix: "prod_" + default: true + workspace: + host: + root_path: /Workspace/prod/${workspace.current_user.userName}/${bundle.name}/${bundle.target} + run_as: + service_principal_name: ${workspace.current_user.userName} + diff --git a/single_tech_samples/databricks/databricks_terraform/images/architecture.png b/single_tech_samples/databricks/databricks_terraform/images/architecture.png new file mode 100644 index 000000000..19b33f4cd Binary files /dev/null and b/single_tech_samples/databricks/databricks_terraform/images/architecture.png differ diff --git a/single_tech_samples/databricks/databricks_terraform/images/databricks-asset-bundle-deploymeny-pipeline.png b/single_tech_samples/databricks/databricks_terraform/images/databricks-asset-bundle-deploymeny-pipeline.png new file mode 100644 index 000000000..2c646f0a4 Binary files /dev/null and b/single_tech_samples/databricks/databricks_terraform/images/databricks-asset-bundle-deploymeny-pipeline.png differ diff --git a/single_tech_samples/databricks/databricks_terraform/requirements-dev.txt b/single_tech_samples/databricks/databricks_terraform/requirements-dev.txt new file mode 100644 index 000000000..0ffbf6aed --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/requirements-dev.txt @@ -0,0 +1,29 @@ +## requirements-dev.txt: dependencies for local development. +## +## For defining dependencies used by jobs in Databricks Workflows, see +## https://docs.databricks.com/dev-tools/bundles/library-dependencies.html + +## Add code completion support for DLT +databricks-dlt + +## pytest is the default package used for testing +pytest + +## Dependencies for building wheel files +setuptools +wheel + +## databricks-connect can be used to run parts of this project locally. +## See https://docs.databricks.com/dev-tools/databricks-connect.html. +## +## databricks-connect is automatically installed if you're using Databricks +## extension for Visual Studio Code +## (https://docs.databricks.com/dev-tools/vscode-ext/dev-tasks/databricks-connect.html). +## +## To manually install databricks-connect, either follow the instructions +## at https://docs.databricks.com/dev-tools/databricks-connect.html +## to install the package system-wide. Or uncomment the line below to install a +## version of db-connect that corresponds to the Databricks Runtime version used +## for this project. +# +# databricks-connect>=15.4,<15.5 diff --git a/single_tech_samples/databricks/databricks_terraform/ruff.toml b/single_tech_samples/databricks/databricks_terraform/ruff.toml new file mode 100644 index 000000000..96955d98b --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/ruff.toml @@ -0,0 +1,84 @@ +# Exclude a variety of commonly ignored directories. +exclude = [ + ".bzr", + ".direnv", + ".eggs", + ".git", + ".git-rewrite", + ".hg", + ".ipynb_checkpoints", + ".mypy_cache", + ".nox", + ".pants.d", + ".pyenv", + ".pytest_cache", + ".pytype", + ".ruff_cache", + ".svn", + ".tox", + ".venv", + ".vscode", + "__pypackages__", + "_build", + "buck-out", + "build", + "dist", + "node_modules", + "site-packages", + "venv", +] + +# Same as Black. +line-length = 88 +indent-width = 4 + +# Assume Python 3.8 +target-version = "py38" + +[lint] +# Enable Pyflakes (`F`) and a subset of the pycodestyle (`E`) codes by default. +# Unlike Flake8, Ruff doesn't enable pycodestyle warnings (`W`) or +# McCabe complexity (`C901`) by default. +select = ["E4", "E7", "E9", "F"] +ignore = [ + "F401", # Ignore "imported but unused" (useful for module imports that are side-effectful) + "E402", # Module level import not top of file +] + +# Allow fix for all enabled rules (when `--fix`) is provided. +fixable = ["ALL"] +unfixable = [] + +# Allow unused variables when underscore-prefixed. +dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$" + +[format] +# Like Black, use double quotes for strings. +quote-style = "double" + +# Like Black, indent with spaces, rather than tabs. +indent-style = "space" + +# Like Black, respect magic trailing commas. +skip-magic-trailing-comma = false + +# Like Black, automatically detect the appropriate line ending. +line-ending = "auto" + +# Enable auto-formatting of code examples in docstrings. Markdown, +# reStructuredText code/literal blocks and doctests are all supported. +# +# This is currently disabled by default, but it is planned for this +# to be opt-out in the future. +docstring-code-format = false + +# Set the line length limit used when formatting code snippets in +# docstrings. +# +# This only has an effect when the `docstring-code-format` setting is +# enabled. +docstring-code-line-length = "dynamic" + +[lint.per-file-ignores] +"__init__.py" = ["E402"] +"*.py" = ["F821"] \ No newline at end of file diff --git a/single_tech_samples/databricks/databricks_terraform/tests/hello_test.py b/single_tech_samples/databricks/databricks_terraform/tests/hello_test.py new file mode 100644 index 000000000..9bd129a82 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/tests/hello_test.py @@ -0,0 +1,18 @@ +# Databricks notebook source +# COMMAND ---------- + +import unittest + +class Greeter: + def __init__(self): + self.message = "Hello Test Message from Dummy File!" + +class TestGreeter(unittest.TestCase): + def test_greeter_message(self): + greeter = Greeter() + self.assertEqual(greeter.message, "Hello Test Message from Dummy File!", "The message should be 'Hello world!'") + +if __name__ == "__main__": + unittest.main(argv=['first-arg-is-ignored'], exit=False) + +# COMMAND ---------- diff --git a/single_tech_samples/databricks/databricks_terraform/utils/generate-databricks-workflows.sh b/single_tech_samples/databricks/databricks_terraform/utils/generate-databricks-workflows.sh new file mode 100755 index 000000000..3a9768769 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/utils/generate-databricks-workflows.sh @@ -0,0 +1,57 @@ +#!/bin/bash + +### +# The Bellow Script is used to generate Databricks Job YAML files for each test file in the tests directory. +# The script will iterate over all the test files and generate a YAML file for each test file under the workflows directory. +### + +# Change to the root directory of the project +cd "$(dirname "$0")/.." || exit + +# Set the directory where YAML files will be generated +OUTPUT_DIR="workflows/" +TEST_FOLDER_PATH="single_tech_samples/databricks/databricks_terraform/tests" + +mkdir -p "$OUTPUT_DIR" + +# Find all _test.py files from the root directory and iterate over them +for test_file in $(find ./tests -type f -name "*_test.py"); do + # Extract the base filename without extension + base_name=$(basename "$test_file" .py) + + # Define the path to the output YAML file + output_file="${OUTPUT_DIR}/${base_name}.job.yml" + + # Generate the YAML content + cat < "$output_file" +resources: + jobs: + ${base_name}: + name: ${base_name} + tasks: + - task_key: ${base_name} + notebook_task: + notebook_path: ${TEST_FOLDER_PATH}/${base_name} + base_parameters: + env: \${bundle.target} + source: GIT + + git_source: + git_url: https://github.com/Azure-Samples/modern-data-warehouse-dataops + git_provider: gitHub + git_branch: main + queue: + enabled: true + + job_clusters: + - job_cluster_key: job_cluster + new_cluster: + spark_version: 15.4.x-scala2.12 + node_type_id: Standard_D4ds_v5 + autoscale: + min_workers: 1 + max_workers: 4 +EOF + + echo "Generated YAML job template for: $base_name -> $output_file" +done diff --git a/single_tech_samples/databricks/databricks_terraform/workflows/hello_test.job.yml b/single_tech_samples/databricks/databricks_terraform/workflows/hello_test.job.yml new file mode 100644 index 000000000..eb8e78db5 --- /dev/null +++ b/single_tech_samples/databricks/databricks_terraform/workflows/hello_test.job.yml @@ -0,0 +1,27 @@ +resources: + jobs: + hello_test: + name: hello_test + tasks: + - task_key: hello_test + notebook_task: + notebook_path: single_tech_samples/databricks/databricks_terraform/tests/hello_test + base_parameters: + env: ${bundle.target} + source: GIT + + git_source: + git_url: https://github.com/Azure-Samples/modern-data-warehouse-dataops + git_provider: gitHub + git_branch: main + queue: + enabled: true + + job_clusters: + - job_cluster_key: job_cluster + new_cluster: + spark_version: 15.4.x-scala2.12 + node_type_id: Standard_D4ds_v5 + autoscale: + min_workers: 1 + max_workers: 4