Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HuggingFaceProcessor with ProcessingStep results in import errors (similar to issues/2656) #4802

Open
solanki-ravi opened this issue Jul 25, 2024 · 5 comments
Labels
component: processing Relates to the SageMaker Processing Platform type: question

Comments

@solanki-ravi
Copy link

solanki-ravi commented Jul 25, 2024

Describe the bug
Using the HuggingFaceProcessor with ProcessingStep results in import errors similar to: #2656

2024-07-25T15:47:57.615Z | /opt/ml/processing/input/entrypoint/evaluate.py: line 2: SAGEMAKER_INPUTS_DIR: command not found
2024-07-25T15:47:57.615Z | /opt/ml/processing/input/entrypoint/evaluate.py: line 3: SAGEMAKER_OUTPUTS_DIR: command not found
2024-07-25T15:47:57.615Z | /opt/ml/processing/input/entrypoint/evaluate.py: line 5: import: command not found
2024-07-25T15:47:57.615Z | /opt/ml/processing/input/entrypoint/evaluate.py: line 6: import: command not found
2024-07-25T15:47:57.615Z | /opt/ml/processing/input/entrypoint/evaluate.py: line 7: import: command not found
2024-07-25T15:47:57.616Z | /opt/ml/processing/input/entrypoint/evaluate.py: line 8: import: command not found
...

To reproduce
Define the HuggingFace processor
huggingface_processor = HuggingFaceProcessor(
role=role,
transformers_version='4.4',
pytorch_version='1.6.0',
instance_count=1,
instance_type=f'{SAGEMAKER_GPU_INSTANCE_TYPE}',
command=["python3"]
)

step_evaluate = ProcessingStep(
name="...",
processor=huggingface_processor,
inputs=[
...
],
outputs=[
...
],
code="src/evaluate.py"
)

pipeline = Pipeline(
name="...",
steps=[ step_evaluate]
)

Execute the pipeline
pipeline.upsert(role_arn=role)
execution = pipeline.start()
execution.wait()

Expected behavior
Error should not happen, and evalute.py should be invoked by the container.

Screenshots or logs
Attached

System information
A description of your system. Please provide:

  • SageMaker Python SDK version:
    $ pip show sagemaker
    Name: sagemaker
    Version: 2.226.1
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans):
    $ pip show transformers
    Name: transformers
    Version: 4.42.4

$ pip show torch
Name: torch
Version: 2.1.2
log-events-viewer-result.csv

Additional context
related? #2656

@svia3 svia3 added the component: pipelines Relates to the SageMaker Pipeline Platform label Aug 1, 2024
@qidewenwhen
Copy link
Member

Hi @solanki-ravi , thanks for reaching out!

Given this job logs:

2024-07-25T15:47:57.615Z | /opt/ml/processing/input/entrypoint/evaluate.py: line 2: SAGEMAKER_INPUTS_DIR: command not found
2024-07-25T15:47:57.615Z | /opt/ml/processing/input/entrypoint/evaluate.py: line 3: SAGEMAKER_OUTPUTS_DIR: command not found
2024-07-25T15:47:57.615Z | /opt/ml/processing/input/entrypoint/evaluate.py: line 5: import: command not found
2024-07-25T15:47:57.615Z | /opt/ml/processing/input/entrypoint/evaluate.py: line 6: import: command not found
2024-07-25T15:47:57.615Z | /opt/ml/processing/input/entrypoint/evaluate.py: line 7: import: command not found
2024-07-25T15:47:57.616Z | /opt/ml/processing/input/entrypoint/evaluate.py: line 8: import: command not found

The command not found error was thrown from your src/evaluate.py script. As this is a customer script, without it, we are not able to reproduce the same issue.

Could you check your script to see what commands are used in line 2-8 in src/evaluate.py?

@solanki-ravi
Copy link
Author

solanki-ravi commented Aug 2, 2024

Thanks @qidewenwhen. evaluate.py and path invocation script is inline here.

--- evaluate.py---

#!/usr/bin/env python3
SAGEMAKER_INPUTS_DIR = '/opt/ml/processing/input'
SAGEMAKER_OUTPUTS_DIR = '/opt/ml/processing/output'

import subprocess

#Install the required libraries before imports
subprocess.check_call(["pip", "install", "-r", "/opt/ml/processing/input/requirements/requirements.txt"])

import argparse
import os
import pandas as pd
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import json
import logging
import tarfile
import numpy as np

logging.basicConfig(level=logging.INFO)

class CustomDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

def compute_metrics(pred, labels):
    preds = pred.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_dir", type=str, default=f'{SAGEMAKER_INPUTS_DIR}/model')
    parser.add_argument("--test_file", type=str, default=f'{SAGEMAKER_INPUTS_DIR}/test/test')
    parser.add_argument("--output_dir", type=str, default=f'{SAGEMAKER_OUTPUTS_DIR}')
    parser.add_argument("--max_length", type=int, default=512)
    args = parser.parse_args()

    # Load dataset
    logging.info(f'Reading test data from: {args.test_file}.')
    test_df = pd.read_csv(args.test_file, header=None)
    logging.info(f'Successfully read test data from: {args.test_file}.')

    test_texts, test_labels = test_df.iloc[:, 1].tolist(), test_df.iloc[:, 0].tolist()

    logging.info(f'Test texts sample: {test_texts[:5]}')
    logging.info(f'Test labels sample: {test_labels[:5]}')
    logging.info(f'Number of test samples: {len(test_texts)}')

    # Extract the tar.gz model file
    model_tar_path = os.path.join(args.model_dir, "model.tar.gz")
    model_extracted_dir = os.path.join(args.model_dir, "extracted")
    logging.info(f'Extracting model.tar.gz, model_tar_path: {model_tar_path}, model_extracted_dir: {model_extracted_dir}')
    os.makedirs(model_extracted_dir, exist_ok=True)
    
    with tarfile.open(model_tar_path) as tar:
        tar.extractall(path=model_extracted_dir)
    
    # Tokenize data
    tokenizer = BertTokenizer.from_pretrained(model_extracted_dir)
    test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=args.max_length)

    # Create dataset
    test_dataset = CustomDataset(test_encodings, test_labels)

    # Load model
    model = BertForSequenceClassification.from_pretrained(model_extracted_dir)
    model.eval()

    # Create data loader
    test_loader = DataLoader(test_dataset, batch_size=16)

    # Perform evaluation
    all_preds = []
    all_labels = []

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    with torch.no_grad():
        for batch in test_loader:
            inputs = {key: val.cuda() for key, val in batch.items() if key != 'labels'}
            labels = batch['labels'].cuda()
            outputs = model(**inputs)
            logits = outputs.logits
            all_preds.append(logits.cpu().numpy())
            all_labels.append(labels.cpu().numpy())

    all_preds = np.concatenate(all_preds, axis=0)
    all_labels = np.concatenate(all_labels, axis=0)

    # Compute metrics
    metrics = compute_metrics(all_preds, all_labels)
    logging.info(f"Evaluation metrics: {metrics}")

    # Save metrics
    os.makedirs(args.output_dir, exist_ok=True)
    with open(os.path.join(args.output_dir, "evaluation_metrics.json"), "w") as f:
        json.dump(metrics, f)


and invocation code:

from sagemaker.huggingface import HuggingFaceProcessor
from sagemaker.workflow.steps import ProcessingInput, ProcessingOutput, ProcessingStep
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.image_uris import retrieve
from sagemaker.processing import ScriptProcessor
from sagemaker import get_execution_role

# Define the HuggingFace processor
# https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-processor

huggingface_processor = HuggingFaceProcessor(
    role=role,
    transformers_version='4.4.2',
    pytorch_version='1.6.0',
    #py_version='py36',
    instance_count=1,
    instance_type=SAGEMAKER_GPU_INSTANCE_TYPE,
    code_location = 'src/',
    command = ["python3", "evaluate.py"]
)

step_evaluate_bert = ProcessingStep(
    name="Newsgroups20-BERT-Evaluate",
    processor=huggingface_processor,
    inputs=[
        ProcessingInput(
            source='s3://<s3 bucket path ...>/model.tar.gz',
            #source=step_train.properties.ModelArtifacts.S3ModelArtifacts,
            destination="/opt/ml/processing/input/model"
        ),
        ProcessingInput(
            source='s3://<s3 bucket path ...>/output/test/test',
            #source=step_process.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,
            destination="/opt/ml/processing/input/test"
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="evaluation",
            source="/opt/ml/processing/output",
            destination=f'{SAGEMAKER_S3_BUCKET}/output/evaluation/'
        )
    ],
    code="src/evaluate.py"
)

# Update the pipeline to include the evaluation step
pipeline = Pipeline(
    name="Newsgroups20Pipeline-Bert-Process-Train-Evaluate",
    #steps=[step_process, step_train, step_evaluate]
    steps=[step_evaluate_bert]
)

@solanki-ravi
Copy link
Author

In order to make troubleshooting simple, I created this Notebook (attached) that results in the same error (no dependency on my code). @qidewenwhen

evaluate.py

import os

if __name__ == '__main__':

    # Print all environment variables
    for key, value in os.environ.items():
        print(f'{key}: {value}')

output logs:

2024-08-05T18:41:42.365-04:00/opt/ml/processing/input/entrypoint/evaluate.py: line 1: import: command not found | /opt/ml/processing/input/entrypoint/evaluate.py: line 1: import: command not found
2024-08-05T18:41:42.365-04:00 | /opt/ml/processing/input/entrypoint/evaluate.py: line 6: syntax error near unexpected token `value'
2024-08-05T18:41:42.365-04:00/opt/ml/processing/input/entrypoint/evaluate.py: line 6: `    for key, value in os.environ.items():'

Reproduce#4802.ipynb.zip

@qidewenwhen
Copy link
Member

Hi @solanki-ravi , sorry for the long delay. Did not receive notification.

And thanks so much for the reproduce notebook, which is very helpful!

Given the error, seems the evaluate.py is being interpreted as a shell script rather than a python script.
Thus, it's complaining "import: command not found"

I checked the pipeline definition in the notebook.

    "Steps": [{
        "Name": "Evaluate",
        "Type": "Processing",
        "Arguments": {
            "AppSpecification": {
                "ContainerEntrypoint": ["/bin/bash", "/opt/ml/processing/input/entrypoint/evaluate.py"]
            },
...

The ContainerEntrypoint starts with /bin/bash. Even if you've explicitly called out python3 in command

huggingface_processor = HuggingFaceProcessor(
    role=role,
    transformers_version='4.4.2',
    pytorch_version='1.6.0',
    #py_version='py36',
    instance_count=1,
    instance_type=SAGEMAKER_GPU_INSTANCE_TYPE,
    code_location = 'src/',
    command = ["python3", "evaluate.py"] <<<<<<<<<<
)

@qidewenwhen
Copy link
Member

I looked into the Processor code and found that the HuggingFaceProcessor extends the FrameworkProcessor, which hard-codes the entrypoint command to be /bin/bash

class FrameworkProcessor(ScriptProcessor):
"""Handles Amazon SageMaker processing tasks for jobs using a machine learning framework."""
framework_entrypoint_command = ["/bin/bash"]

In the _set_entrypoint method, the bash command is used and it also mentions that the user input command is ignored in favor of the self.framework_entrypoint_command

def _set_entrypoint(self, command, user_script_name):
"""Framework processor override for setting processing job entrypoint.
Args:
command ([str]): Ignored in favor of self.framework_entrypoint_command
user_script_name (str): A filename with an extension.
"""
user_script_location = str(
pathlib.PurePosixPath(
self._CODE_CONTAINER_BASE_PATH, self._CODE_CONTAINER_INPUT_NAME, user_script_name
)
)
self.entrypoint = self.framework_entrypoint_command + [user_script_location]

Given the above, seems the framework containers including HuggingFace expects bash scripts only.

As my team does not own these processors, will need the corresponding team for further assistance if you hope to expand the usage for python code.
I'll relabel this issue so that it can be redirected to the right team.

@qidewenwhen qidewenwhen added component: processing Relates to the SageMaker Processing Platform and removed component: pipelines Relates to the SageMaker Pipeline Platform labels Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: processing Relates to the SageMaker Processing Platform type: question
Projects
None yet
Development

No branches or pull requests

3 participants