Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JARK Stack - Error while launching training step in the dogbooth Jupyter notebook #537

Open
1 task done
rivasdam opened this issue May 20, 2024 · 4 comments
Open
1 task done

Comments

@rivasdam
Copy link

Description

After deploying the JARK stack successfully and connecting to the dogbooth Jupyter notebook to follow the different execution steps, an error occurs while running the launch training step (step 15) and the training gets stuck immediately.

Notebook dreambooth training step:

# Launch the training and push the output model to huggingface
! accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of [v]dog" \
  --resolution=768 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --gradient_checkpointing \
  --learning_rate=1e-6 \
  --lr_scheduler="constant" \
  --enable_xformers_memory_efficient_attention \
  --use_8bit_adam \
  --lr_warmup_steps=0 \
  --max_train_steps=800 \
  --push_to_hub

Error output:

Steps: 0%| | 0/800 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/jovyan/diffusers/examples/dreambooth/train_dreambooth.py", line 1443, in <module> main(args) File "/home/jovyan/diffusers/examples/dreambooth/train_dreambooth.py", line 1224, in main for step, batch in enumerate(train_dataloader): File "/opt/conda/lib/python3.10/site-packages/accelerate/data_loader.py", line 454, in __iter__ current_batch = next(dataloader_iter) File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in __next__ data = self._next_data() File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/jovyan/diffusers/examples/dreambooth/train_dreambooth.py", line 673, in __getitem__ instance_image = Image.open(self.instance_images_path[index % self.num_instance_images]) File "/opt/conda/lib/python3.10/site-packages/PIL/Image.py", line 3227, in open fp = builtins.open(filename, "rb") IsADirectoryError: [Errno 21] Is a directory: '/home/jovyan/diffusers/examples/dreambooth/dog/.huggingface' Steps: 0%| | 0/800 [00:00<?, ?it/s] Traceback (most recent call last): File "/opt/conda/bin/accelerate", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1082, in launch_command simple_launcher(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 688, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/opt/conda/bin/python3.10', 'train_dreambooth.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-2-1', '--instance_data_dir=dog', '--output_dir=dogbooth', '--instance_prompt=a photo of [v]dog', '--resolution=768', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--learning_rate=1e-6', '--lr_scheduler=constant', '--enable_xformers_memory_efficient_attention', '--use_8bit_adam', '--lr_warmup_steps=0', '--max_train_steps=800', '--push_to_hub']' returned non-zero exit status 1.

  • ✋ I have searched the open/closed issues and my issue is not listed.

Versions

  • Module version [Required]:

  • Terraform version:
    Terraform v1.8.3

  • Provider version(s):
    Terraform v1.8.3

Reproduction Code [Required]

Steps to reproduce the behavior:

  1. Login to Jupyter Hub
  2. Open dogbooth.ipynb notebook
  3. Run each notebook step from 1 to 15

Expected behavior

The dreambooth training process completes and the model is created, allowing to continue with inference.

Actual behavior

The training fails.

Terminal Output screenshots

image
image

@BinaryKevin
Copy link

I met the same problem with you: IsADirectoryError: [Errno 21] Is a directory: '/content/diffusers/examples/dreambooth/dog/.huggingface' Have you solved it?

@shivkumr
Copy link

deleting the "/content/diffusers/examples/dreambooth/dog/.huggingface" resolved the issue, but not sure how it was created

Copy link
Contributor

This issue has been automatically marked as stale because it has been open 30 days
with no activity. Remove stale label or comment or this issue will be closed in 10 days

@github-actions github-actions bot added the stale label Jul 25, 2024
@rivasdam
Copy link
Author

deleting the "/content/diffusers/examples/dreambooth/dog/.huggingface" resolved the issue, but not sure how it was created

I will re-deploy and check this, will update soon.

@github-actions github-actions bot removed the stale label Jul 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants