You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running single gpu fine tuning on 1xA100 for CogVideo, we encountered the following error regarding the lr-scheduler:
(ResearchProject) root@n469zpgtu9:/notebooks/CogVideo1/finetune# bash finetune_single_rank.sh
Traceback (most recent call last):
File "/notebooks/miniconda3/envs/ResearchProject/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1084, in launch_command
args, defaults, mp_from_config_flag = _validate_launch_command(args)
File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/launch.py", line 921, in _validate_launch_command
defaults = load_config_from_file(args.config_file)
File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/config/config_args.py", line 72, in load_config_from_file
return config_class.from_yaml_file(yaml_file=config_file)
File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/config/config_args.py", line 152, in from_yaml_file
raise ValueError(
ValueError: The config file at accelerate_config_machine_single.yaml had unknown keys (['lr_scheduler']), please try upgrading your `accelerate` version or fix (and potentially remove) these keys from your config file.
(ResearchProject) root@n469zpgtu9:/notebooks/CogVideo1/finetune# bash finetune_single_rank.sh
[2024-11-13 23:55:44,456] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-13 23:55:49,040] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-13 23:55:50,103] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-13 23:55:50,103] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[W1113 23:55:50.229374271 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
11/13/2024 23:55:50 - INFO - __main__ - Distributed environment: DEEPSPEED Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: bf16
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none', 'nvme_path': None}, 'offload_param': {'device': 'none', 'nvme_path': None}, 'stage3_gather_16bit_weights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'bf16': {'enabled': True}, 'fp16': {'enabled': False}}
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Downloading shards: 100%|███████████████████████████████████████████| 2/2 [00:00<00:00, 11199.74it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████| 2/2 [00:02<00:00, 1.05s/it]
{'use_learned_positional_embeddings'} was not found in config. Values will be initialized to default values.
11/13/2024 23:55:59 - WARNING - __main__ - use_8bit_adam is ignored when optimizer is not set to 'Adam' or 'AdamW'. Optimizer was set to adamw
[rank0]: Traceback (most recent call last):
[rank0]: File "/notebooks/CogVideo1/finetune/train_cogvideox_lora.py", line 1543, in <module>
[rank0]: main(args)
[rank0]: File "/notebooks/CogVideo1/finetune/train_cogvideox_lora.py", line 1238, in main
[rank0]: transformer, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
[rank0]: File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/accelerator.py", line 1303, in prepare
[rank0]: result = self._prepare_deepspeed(*args)
[rank0]: File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/accelerator.py", line 1686, in _prepare_deepspeed
[rank0]: raise ValueError(
[rank0]: ValueError: Either specify a scheduler in the config file or pass in the `lr_scheduler_callable` parameter when using `accelerate.utils.DummyScheduler`.
E1113 23:59:33.424000 140255595804480 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2471) of binary: /notebooks/miniconda3/envs/ResearchProject/bin/python
Traceback (most recent call last):
File "/notebooks/miniconda3/envs/ResearchProject/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1091, in launch_command
deepspeed_launcher(args)
File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/accelerate/commands/launch.py", line 787, in deepspeed_launcher
distrib_run.run(args)
File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/notebooks/miniconda3/envs/ResearchProject/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_cogvideox_lora.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-11-13_23:59:33
host : n469zpgtu9
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2471)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
#!/bin/bash
export MODEL_PATH="THUDM/CogVideoX-2b"
export CACHE_PATH="~/.cache"
export DATASET_PATH="/notebooks/Disney-VideoGeneration-Dataset"
export OUTPUT_PATH="cogvideox-lora-single-node"
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
# if you are not using wth 8 gus, change `accelerate_config_machine_single.yaml` num_processes as your gpu number
accelerate launch --config_file accelerate_config_machine_single.yaml \
train_cogvideox_lora.py \
--gradient_checkpointing \
--pretrained_model_name_or_path $MODEL_PATH \
--cache_dir $CACHE_PATH \
--enable_tiling \
--enable_slicing \
--instance_data_root $DATASET_PATH \
--caption_column prompt.txt \
--video_column videos.txt \
--validation_prompt "DISNEY A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions:::A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance" \
--validation_prompt_separator ::: \
--num_validation_videos 1 \
--validation_epochs 100 \
--seed 42 \
--rank 128 \
--lora_alpha 64 \
--mixed_precision bf16 \
--output_dir $OUTPUT_PATH \
--height 480 \
--width 720 \
--fps 8 \
--max_num_frames 49 \
--skip_frames_start 0 \
--skip_frames_end 0 \
--train_batch_size 1 \
--num_train_epochs 30 \
--checkpointing_steps 1000 \
--gradient_accumulation_steps 1 \
--learning_rate 1e-3 \
--lr_scheduler cosine_with_restarts \
--lr_warmup_steps 200 \
--lr_num_cycles 1 \
--optimizer AdamW \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--max_grad_norm 1.0 \
--allow_tf32 \
--report_to wandb \
--use_8bit_adam
bash finetune_single_rank.sh
Expected behavior / 期待表现
We expect the finetuning to proceed.
The text was updated successfully, but these errors were encountered:
System Info / 系統信息
When running single gpu fine tuning on 1xA100 for CogVideo, we encountered the following error regarding the lr-scheduler:
Information / 问题信息
Reproduction / 复现过程
Accelerate config
Finetune single rank script
bash finetune_single_rank.sh
Expected behavior / 期待表现
We expect the finetuning to proceed.
The text was updated successfully, but these errors were encountered: