accelerate Lr-scheduler issue #502

LakshyaChaudhry · 2024-11-14T00:24:41Z

System Info / 系統信息

When running single gpu fine tuning on 1xA100 for CogVideo, we encountered the following error regarding the lr-scheduler:

(ResearchProject) root@n469zpgtu9:/notebooks/CogVideo1/finetune# bash finetune_single_rank.sh
Traceback (most recent call last):
  File "/notebooks/miniconda3/envs/ResearchProject/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1084, in launch_command
    args, defaults, mp_from_config_flag = _validate_launch_command(args)
  File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/launch.py", line 921, in _validate_launch_command
    defaults = load_config_from_file(args.config_file)
  File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/config/config_args.py", line 72, in load_config_from_file
    return config_class.from_yaml_file(yaml_file=config_file)
  File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/config/config_args.py", line 152, in from_yaml_file
    raise ValueError(
ValueError: The config file at accelerate_config_machine_single.yaml had unknown keys (['lr_scheduler']), please try upgrading your `accelerate` version or fix (and potentially remove) these keys from your config file.
(ResearchProject) root@n469zpgtu9:/notebooks/CogVideo1/finetune# bash finetune_single_rank.sh
[2024-11-13 23:55:44,456] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-13 23:55:49,040] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-13 23:55:50,103] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-13 23:55:50,103] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[W1113 23:55:50.229374271 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
11/13/2024 23:55:50 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none', 'nvme_path': None}, 'offload_param': {'device': 'none', 'nvme_path': None}, 'stage3_gather_16bit_weights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'bf16': {'enabled': True}, 'fp16': {'enabled': False}}

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Downloading shards: 100%|███████████████████████████████████████████| 2/2 [00:00<00:00, 11199.74it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████| 2/2 [00:02<00:00,  1.05s/it]
{'use_learned_positional_embeddings'} was not found in config. Values will be initialized to default values.
11/13/2024 23:55:59 - WARNING - __main__ - use_8bit_adam is ignored when optimizer is not set to 'Adam' or 'AdamW'. Optimizer was set to adamw
[rank0]: Traceback (most recent call last):
[rank0]:   File "/notebooks/CogVideo1/finetune/train_cogvideox_lora.py", line 1543, in <module>
[rank0]:     main(args)
[rank0]:   File "/notebooks/CogVideo1/finetune/train_cogvideox_lora.py", line 1238, in main
[rank0]:     transformer, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
[rank0]:   File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/accelerator.py", line 1303, in prepare
[rank0]:     result = self._prepare_deepspeed(*args)
[rank0]:   File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/accelerator.py", line 1686, in _prepare_deepspeed
[rank0]:     raise ValueError(
[rank0]: ValueError: Either specify a scheduler in the config file or pass in the `lr_scheduler_callable` parameter when using `accelerate.utils.DummyScheduler`.
E1113 23:59:33.424000 140255595804480 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2471) of binary: /notebooks/miniconda3/envs/ResearchProject/bin/python
Traceback (most recent call last):
  File "/notebooks/miniconda3/envs/ResearchProject/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1091, in launch_command
    deepspeed_launcher(args)
  File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/launch.py", line 787, in deepspeed_launcher
    distrib_run.run(args)
  File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_cogvideox_lora.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-13_23:59:33
  host      : n469zpgtu9
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2471)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Information / 问题信息

The official example scripts / 官方的示例脚本
My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

Accelerate config

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
dynamo_backend: 'no'
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Finetune single rank script

#!/bin/bash

export MODEL_PATH="THUDM/CogVideoX-2b"
export CACHE_PATH="~/.cache"
export DATASET_PATH="/notebooks/Disney-VideoGeneration-Dataset"
export OUTPUT_PATH="cogvideox-lora-single-node"
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES

# if you are not using wth 8 gus, change `accelerate_config_machine_single.yaml` num_processes as your gpu number
accelerate launch --config_file accelerate_config_machine_single.yaml \
  train_cogvideox_lora.py \
  --gradient_checkpointing \
  --pretrained_model_name_or_path $MODEL_PATH \
  --cache_dir $CACHE_PATH \
  --enable_tiling \
  --enable_slicing \
  --instance_data_root $DATASET_PATH \
  --caption_column prompt.txt \
  --video_column videos.txt \
  --validation_prompt "DISNEY A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions:::A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance" \
  --validation_prompt_separator ::: \
  --num_validation_videos 1 \
  --validation_epochs 100 \
  --seed 42 \
  --rank 128 \
  --lora_alpha 64 \
  --mixed_precision bf16 \
  --output_dir $OUTPUT_PATH \
  --height 480 \
  --width 720 \
  --fps 8 \
  --max_num_frames 49 \
  --skip_frames_start 0 \
  --skip_frames_end 0 \
  --train_batch_size 1 \
  --num_train_epochs 30 \
  --checkpointing_steps 1000 \
  --gradient_accumulation_steps 1 \
  --learning_rate 1e-3 \
  --lr_scheduler cosine_with_restarts \
  --lr_warmup_steps 200 \
  --lr_num_cycles 1 \
  --optimizer AdamW \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --max_grad_norm 1.0 \
  --allow_tf32 \
  --report_to wandb \
  --use_8bit_adam

bash finetune_single_rank.sh

Expected behavior / 期待表现

We expect the finetuning to proceed.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accelerate Lr-scheduler issue #502

accelerate Lr-scheduler issue #502

LakshyaChaudhry commented Nov 14, 2024

accelerate Lr-scheduler issue #502

accelerate Lr-scheduler issue #502

Comments

LakshyaChaudhry commented Nov 14, 2024

System Info / 系統信息

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现