[azure-ml] Serverless Spark compute fails with missing Synapse cluster identifier #39646

aagatavanade · 2025-02-10T17:27:07Z

Package Name: azure-ml
Package Version: 2.34.0
Operating System:
Python Version: 3.12.8

Describe the bug
I am trying to execute Python script on serverless Spark compute on Azure Machine Learning.
I have attached a user assigned Managed Identity to the AML workspace, and am defining the Spark component accordingly.
I have submitted the pipeline, and Spark job using the CLI.
Soon after the component starts running, I get the following Native Error:

Traceback (most recent call last):
  File "data_prep/data_prep.py", line 75, in <module>
    run()
  File "data_prep/data_prep.py", line 57, in run
    df = table.to_pandas_dataframe()
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/azureml/dataprep/api/_loggerfactory.py", line 279, in wrapper
    return func(*args, **kwargs)
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/mltable/mltable.py", line 1311, in to_pandas_dataframe
    raise e
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/mltable/mltable.py", line 1308, in to_pandas_dataframe
    raise _reclassify_rslex_error(e)
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/azureml/dataprep/api/mltable/_validation_and_error_handler.py", line 90, in _reclassify_rslex_error
    raise err
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/mltable/mltable.py", line 1300, in to_pandas_dataframe
    return get_dataframe_reader().to_pandas_dataframe(self._dataflow)
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 355, in to_pandas_dataframe
    return _execute(
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 266, in _execute
    raise e
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 246, in _execute
    return rslex_execute()
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 177, in rslex_execute
    (batches, num_partitions, stream_columns) = executor.execute_dataflow(dataflow,
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/azureml/dataprep/api/_rslex_executor.py", line 26, in execute_dataflow
    (batches, num_partitions, stream_columns) = Executor().execute_dataflow(script,
azureml.dataprep.api.errorhandlers.ExecutionError: 
Error Code: ScriptExecution.StreamAccess.Unexpected
Native Error: Dataflow visit error: ExecutionError(StreamError(Unknown("An unexpected error occurred while resolving AccessToken: Missing env var: 'AZUREML_SYNAPSE_CLUSTER_IDENTIFIER'", None)))
	VisitError(ExecutionError(StreamError(Unknown("An unexpected error occurred while resolving AccessToken: Missing env var: 'AZUREML_SYNAPSE_CLUSTER_IDENTIFIER'", None))))
=> Failed with execution error: error in streaming from input data sources
	ExecutionError(StreamError(Unknown("An unexpected error occurred while resolving AccessToken: Missing env var: 'AZUREML_SYNAPSE_CLUSTER_IDENTIFIER'", None)))
Error Message: Got unexpected error: An unexpected error occurred while resolving AccessToken: Missing env var: 'AZUREML_SYNAPSE_CLUSTER_IDENTIFIER'. | session_id=xxxxxxx

Exiting content uploader.

I haven't seen any mention to setting up the environment variable AZUREML_SYNAPSE_CLUSTER_IDENTIFIER for the Spark component.
The managed identity has AI Developer role assigned.

To Reproduce
Steps to reproduce the behavior:

Create a datastore with parquet files as input for the Spark component.
Define Spark component:

$schema: http://azureml/sdk-2-0/SparkComponent.json
type: spark

name: model
display_name: Data preparation for model
description: Read an input data, preprocess the input and split it to train and test
version: 0.0.0
is_deterministic: true

code: ../../src/components
entry:
  file: data_prep/data_prep.py 

inputs:
  data:
    type: uri_folder
    mode: direct
  test_train_ratio:
    type: number

outputs:
  train_data:
    type: mltable
    mode: direct
  test_data:
    type: mltable
    mode: direct

identity:
  type: managed
resources:
      instance_type: "Standard_E4S_V3"
      runtime_version: "3.4.0"

conf:
  spark.driver.cores: 1
  spark.driver.memory: 2g
  spark.executor.cores: 2
  spark.executor.memory: 2g
  spark.executor.instances: 1
  spark.dynamicAllocation.enabled: True
  spark.dynamicAllocation.minExecutors: 1
  spark.dynamicAllocation.maxExecutors: 4
  spark.hadoop.aml.enable_cache: True
  spark.aml.internal.system.job: True
  spark.synapse.library.python.env: |
    channels:
      - defaults
    dependencies:
      - python=3.10
      - pip:
        - scipy~=1.10.0
        - mltable~=1.6.1
        - azureml-fsspec
        - fsspec~=2023.4.0

    name: momo-base-spark

args: >-
  --data ${{inputs.data}} 
  --test_train_ratio ${{inputs.test_train_ratio}}
  --train_data ${{outputs.train_data}}
  --test_data ${{outputs.test_data}}

Define spark component script:

import argparse
import pandas as pd
import mltable
from sklearn.model_selection import train_test_split
import argparse
from shared_utilities.constants import PORT_APPLICATION_MAP, APPS_OF_INTEREST
from shared_utilities.utils import init_spark, save_spark_df_as_mltable

def run():
    # Parse arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", type=str, help="input path to data")
    parser.add_argument("--test_train_ratio", type=float, default=0.2, required=False, help="test train ratio")
    parser.add_argument("--train_data", type=str, help="output path to train data")
    parser.add_argument("--test_data", type=str, help="output path to test data")
    args = parser.parse_args()

    # Load the data
    # df = pd.read_parquet(args.data)
    paths = [
        {"pattern": f"{args.data}processDate=*/*.parquet"},
    ]
    table = mltable.from_parquet_files(paths=paths)
    print('Paths: ', paths)
    partition_format = "processDate={PartitionDate:yyyy-MM-dd}/*.parquet"
    table = table.extract_columns_from_partition_format(partition_format)
    print('Table: ', table)
    df = table.to_pandas_dataframe()  ## <- Pipeline fails here

Define AML pipeline:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline

experiment_name: model-training
display_name: model-training
description:  ML model training pipeline

settings:
  default_compute: azureml:serverless

inputs:
  pipeline_job_input_data: 
    type: uri_folder
    path:  XXXXXXX
    mode: direct
  pipeline_job_test_train_ratio: 0.2
  pipeline_job_n_estimators: 400
  pipeline_job_max_depth: 40
  pipeline_job_min_samples_split: 10
  pipeline_job_random_state: 42
  pipeline_job_registered_model_name: 'network_traffic_tagging_model'


jobs:
  data_prep_job:
    type: spark
    component: ../components/data_prep.yml
    identity:
      type: managed
    resources:
      instance_type: "Standard_E4S_V3"
      runtime_version: "3.4.0"
    inputs:
      data: ${{parent.inputs.pipeline_job_input_data}}
      test_train_ratio: ${{parent.inputs.pipeline_job_test_train_ratio}}
    outputs:
      train_data: 
        type: mltable
        mode: direct
      test_data: 
        type: mltable
        mode: direct

Expected behavior
I expected the component to run its code using a managed identity.

I am not sure why the component is failing. I assumed that when using Spark Serverless compute, there would be no need to specify a Synapse cluster manually.
Any thoughts or help will be deeply appreciated.

The text was updated successfully, but these errors were encountered:

github-actions · 2025-02-10T17:27:48Z

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/azure-ml-sdk @azureml-github.

achauhan-scc assigned vivram Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[azure-ml] Serverless Spark compute fails with missing Synapse cluster identifier #39646

[azure-ml] Serverless Spark compute fails with missing Synapse cluster identifier #39646

aagatavanade commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

[azure-ml] Serverless Spark compute fails with missing Synapse cluster identifier #39646

[azure-ml] Serverless Spark compute fails with missing Synapse cluster identifier #39646

Comments

aagatavanade commented Feb 10, 2025

github-actions bot commented Feb 10, 2025