Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[azure-ml] Serverless Spark compute fails with missing Synapse cluster identifier #39646

Open
aagatavanade opened this issue Feb 10, 2025 · 1 comment
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Machine Learning needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@aagatavanade
Copy link

  • Package Name: azure-ml
  • Package Version: 2.34.0
  • Operating System:
  • Python Version: 3.12.8

Describe the bug
I am trying to execute Python script on serverless Spark compute on Azure Machine Learning.
I have attached a user assigned Managed Identity to the AML workspace, and am defining the Spark component accordingly.
I have submitted the pipeline, and Spark job using the CLI.
Soon after the component starts running, I get the following Native Error:

Traceback (most recent call last):
  File "data_prep/data_prep.py", line 75, in <module>
    run()
  File "data_prep/data_prep.py", line 57, in run
    df = table.to_pandas_dataframe()
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/azureml/dataprep/api/_loggerfactory.py", line 279, in wrapper
    return func(*args, **kwargs)
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/mltable/mltable.py", line 1311, in to_pandas_dataframe
    raise e
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/mltable/mltable.py", line 1308, in to_pandas_dataframe
    raise _reclassify_rslex_error(e)
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/azureml/dataprep/api/mltable/_validation_and_error_handler.py", line 90, in _reclassify_rslex_error
    raise err
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/mltable/mltable.py", line 1300, in to_pandas_dataframe
    return get_dataframe_reader().to_pandas_dataframe(self._dataflow)
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 355, in to_pandas_dataframe
    return _execute(
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 266, in _execute
    raise e
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 246, in _execute
    return rslex_execute()
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/azureml/dataprep/api/_dataframereader.py", line 177, in rslex_execute
    (batches, num_partitions, stream_columns) = executor.execute_dataflow(dataflow,
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/azureml/dataprep/api/_rslex_executor.py", line 26, in execute_dataflow
    (batches, num_partitions, stream_columns) = Executor().execute_dataflow(script,
azureml.dataprep.api.errorhandlers.ExecutionError: 
Error Code: ScriptExecution.StreamAccess.Unexpected
Native Error: Dataflow visit error: ExecutionError(StreamError(Unknown("An unexpected error occurred while resolving AccessToken: Missing env var: 'AZUREML_SYNAPSE_CLUSTER_IDENTIFIER'", None)))
	VisitError(ExecutionError(StreamError(Unknown("An unexpected error occurred while resolving AccessToken: Missing env var: 'AZUREML_SYNAPSE_CLUSTER_IDENTIFIER'", None))))
=> Failed with execution error: error in streaming from input data sources
	ExecutionError(StreamError(Unknown("An unexpected error occurred while resolving AccessToken: Missing env var: 'AZUREML_SYNAPSE_CLUSTER_IDENTIFIER'", None)))
Error Message: Got unexpected error: An unexpected error occurred while resolving AccessToken: Missing env var: 'AZUREML_SYNAPSE_CLUSTER_IDENTIFIER'. | session_id=xxxxxxx

Exiting content uploader.

I haven't seen any mention to setting up the environment variable AZUREML_SYNAPSE_CLUSTER_IDENTIFIER for the Spark component.
The managed identity has AI Developer role assigned.

To Reproduce
Steps to reproduce the behavior:

  1. Create a datastore with parquet files as input for the Spark component.
  2. Define Spark component:
$schema: http://azureml/sdk-2-0/SparkComponent.json
type: spark

name: model
display_name: Data preparation for model
description: Read an input data, preprocess the input and split it to train and test
version: 0.0.0
is_deterministic: true

code: ../../src/components
entry:
  file: data_prep/data_prep.py 

inputs:
  data:
    type: uri_folder
    mode: direct
  test_train_ratio:
    type: number

outputs:
  train_data:
    type: mltable
    mode: direct
  test_data:
    type: mltable
    mode: direct

identity:
  type: managed
resources:
      instance_type: "Standard_E4S_V3"
      runtime_version: "3.4.0"

conf:
  spark.driver.cores: 1
  spark.driver.memory: 2g
  spark.executor.cores: 2
  spark.executor.memory: 2g
  spark.executor.instances: 1
  spark.dynamicAllocation.enabled: True
  spark.dynamicAllocation.minExecutors: 1
  spark.dynamicAllocation.maxExecutors: 4
  spark.hadoop.aml.enable_cache: True
  spark.aml.internal.system.job: True
  spark.synapse.library.python.env: |
    channels:
      - defaults
    dependencies:
      - python=3.10
      - pip:
        - scipy~=1.10.0
        - mltable~=1.6.1
        - azureml-fsspec
        - fsspec~=2023.4.0

    name: momo-base-spark

args: >-
  --data ${{inputs.data}} 
  --test_train_ratio ${{inputs.test_train_ratio}}
  --train_data ${{outputs.train_data}}
  --test_data ${{outputs.test_data}}
  1. Define spark component script:
import argparse
import pandas as pd
import mltable
from sklearn.model_selection import train_test_split
import argparse
from shared_utilities.constants import PORT_APPLICATION_MAP, APPS_OF_INTEREST
from shared_utilities.utils import init_spark, save_spark_df_as_mltable

def run():
    # Parse arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", type=str, help="input path to data")
    parser.add_argument("--test_train_ratio", type=float, default=0.2, required=False, help="test train ratio")
    parser.add_argument("--train_data", type=str, help="output path to train data")
    parser.add_argument("--test_data", type=str, help="output path to test data")
    args = parser.parse_args()

    # Load the data
    # df = pd.read_parquet(args.data)
    paths = [
        {"pattern": f"{args.data}processDate=*/*.parquet"},
    ]
    table = mltable.from_parquet_files(paths=paths)
    print('Paths: ', paths)
    partition_format = "processDate={PartitionDate:yyyy-MM-dd}/*.parquet"
    table = table.extract_columns_from_partition_format(partition_format)
    print('Table: ', table)
    df = table.to_pandas_dataframe()  ## <- Pipeline fails here
  1. Define AML pipeline:
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline

experiment_name: model-training
display_name: model-training
description:  ML model training pipeline

settings:
  default_compute: azureml:serverless

inputs:
  pipeline_job_input_data: 
    type: uri_folder
    path:  XXXXXXX
    mode: direct
  pipeline_job_test_train_ratio: 0.2
  pipeline_job_n_estimators: 400
  pipeline_job_max_depth: 40
  pipeline_job_min_samples_split: 10
  pipeline_job_random_state: 42
  pipeline_job_registered_model_name: 'network_traffic_tagging_model'


jobs:
  data_prep_job:
    type: spark
    component: ../components/data_prep.yml
    identity:
      type: managed
    resources:
      instance_type: "Standard_E4S_V3"
      runtime_version: "3.4.0"
    inputs:
      data: ${{parent.inputs.pipeline_job_input_data}}
      test_train_ratio: ${{parent.inputs.pipeline_job_test_train_ratio}}
    outputs:
      train_data: 
        type: mltable
        mode: direct
      test_data: 
        type: mltable
        mode: direct

Expected behavior
I expected the component to run its code using a managed identity.

I am not sure why the component is failing. I assumed that when using Spark Serverless compute, there would be no need to specify a Synapse cluster manually.
Any thoughts or help will be deeply appreciated.

@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Machine Learning needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. labels Feb 10, 2025
Copy link

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/azure-ml-sdk @azureml-github.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Machine Learning needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests

2 participants