We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2024-06-12 19:36:07,800] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-12 19:36:09,648] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-06-12 19:36:09,648] [INFO] [runner.py:568:main] cmd = anaconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None LMFlow/examples/finetune.py --model_name_or_path huggingface/hub/Meta-Llama-3-70B --trust_remote_code 0 --dataset_path LMFlow/data/alpaca/train_conversation --output_dir output_models/finetune --overwrite_output_dir --conversation_template llama3 --num_train_epochs 0.01 --learning_rate 2e-5 --disable_group_texts 1 --block_size 256 --per_device_train_batch_size 1 --deepspeed LMFlow/configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1 [2024-06-12 19:36:11,661] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-12 19:36:12,366] [INFO] [launch.py:138:main] 0 TORCH_NCCL_BLOCKING_WAIT=1 [2024-06-12 19:36:12,366] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]} [2024-06-12 19:36:12,366] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0 [2024-06-12 19:36:12,366] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2024-06-12 19:36:12,366] [INFO] [launch.py:163:main] dist_world_size=2 [2024-06-12 19:36:12,366] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2024-06-12 19:36:12,419] [INFO] [launch.py:253:main] process 40472 spawned with command: ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] [2024-06-12 19:36:12,466] [INFO] [launch.py:253:main] process 40473 spawned with command: ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=1', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] [2024-06-12 19:36:17,298] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-12 19:36:17,298] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( [2024-06-12 19:36:20,965] [INFO] [comm.py:637:init_distributed] cdb=None [2024-06-12 19:36:20,965] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-06-12 19:36:20,965] [INFO] [comm.py:637:init_distributed] cdb=None [rank1]: Traceback (most recent call last): [rank1]: File "LMFlow/examples/finetune.py", line 61, in [rank1]: main() [rank1]: File "LMFlow/examples/finetune.py", line 44, in main [rank1]: model_args, data_args, pipeline_args = parser.parse_args_into_dataclasses() [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses [rank1]: obj = dtype(**inputs) [rank1]: File "", line 135, in init [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/training_args.py", line 1641, in post_init [rank1]: and (self.device.type == "cpu" and not is_torch_greater_or_equal_than_2_3) [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/training_args.py", line 2149, in device [rank1]: return self._setup_devices [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/utils/generic.py", line 59, in get [rank1]: cached = self.fget(obj) [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/training_args.py", line 2077, in _setup_devices [rank1]: self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout)) [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/accelerate/state.py", line 280, in init [rank1]: self.set_device() [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/accelerate/state.py", line 790, in set_device [rank1]: torch.cuda.set_device(self.device) [rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/cuda/init.py", line 399, in set_device [rank1]: torch._C._cuda_setDevice(device) [rank1]: RuntimeError: CUDA error: invalid device ordinal [rank1]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
TORCH_USE_CUDA_DSA
06/12/2024 19:36:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( [2024-06-12 19:36:22,477] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 40472 [2024-06-12 19:36:22,531] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 40473 [2024-06-12 19:36:22,531] [ERROR] [launch.py:322:sigkill_handler] ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=1', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1
`#!/bin/bash
export TORCH_NCCL_BLOCKING_WAIT=1 export CUDA_LAUNCH_BLOCKING=1
model_name_or_path=huggingface/hub/Meta-Llama-3-70B dataset_path=LMFlow/data/alpaca/train_conversation output_dir=output_models/finetune deepspeed_args="--num_gpus=2 --master_port=11000" conversation_template=llama3
trust_remote_code=0
while [[ $# -ge 1 ]]; do key="$1" case ${key} in -m|--model_name_or_path) model_name_or_path="$2" shift ;; -d|--dataset_path) dataset_path="$2" shift ;; -o|--output_model_path) output_dir="$2" shift ;; --conversation_template) conversation_template="$2" shift ;; --deepspeed_args) deepspeed_args="$2" shift ;; --trust_remote_code) trust_remote_code="$2" shift ;; *) echo "error: unknown option "${key}"" 1>&2 exit 1 esac shift done
exp_id=finetune project_dir=$(cd "$(dirname $0)"/..; pwd) log_dir=${project_dir}/log/${exp_id} mkdir -p ${output_dir} ${log_dir}
deepspeed ${deepspeed_args} LMFlow/examples/finetune.py --model_name_or_path ${model_name_or_path} --trust_remote_code ${trust_remote_code} --dataset_path ${dataset_path} --output_dir ${output_dir} --overwrite_output_dir --conversation_template ${conversation_template} --num_train_epochs 0.01 --learning_rate 2e-5 --disable_group_texts 1 --block_size 256 --per_device_train_batch_size 1 --deepspeed LMFlow/configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1 | tee ${log_dir}/train.log 2> ${log_dir}/train.err`
How can I fix this problem?
The text was updated successfully, but these errors were encountered:
It seems like a cuda device mismatch issue.
[rank1]: RuntimeError: CUDA error: invalid device ordinal
I guess you've set CUDA_VISIBLE_DEVICES somewhere else accidently and leads to a mismatch. Maybe look at: https://stackoverflow.com/questions/64334033/how-to-solve-runtimeerror-cuda-error-invalid-device-ordinal Or, try change:
deepspeed_args="--num_gpus=2 --master_port=11000"
to
deepspeed_args="--include localhost:x,x --master_port=11000"
Sorry, something went wrong.
It seems like a cuda device mismatch issue. [rank1]: RuntimeError: CUDA error: invalid device ordinal I guess you've set CUDA_VISIBLE_DEVICES somewhere else accidently and leads to a mismatch. Maybe look at: https://stackoverflow.com/questions/64334033/how-to-solve-runtimeerror-cuda-error-invalid-device-ordinal Or, try change: deepspeed_args="--num_gpus=2 --master_port=11000" to deepspeed_args="--include localhost:x,x --master_port=11000"
Thanks, it solves my problem.
No branches or pull requests
[2024-06-12 19:36:07,800] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-12 19:36:09,648] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-06-12 19:36:09,648] [INFO] [runner.py:568:main] cmd = anaconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None LMFlow/examples/finetune.py --model_name_or_path huggingface/hub/Meta-Llama-3-70B --trust_remote_code 0 --dataset_path LMFlow/data/alpaca/train_conversation --output_dir output_models/finetune --overwrite_output_dir --conversation_template llama3 --num_train_epochs 0.01 --learning_rate 2e-5 --disable_group_texts 1 --block_size 256 --per_device_train_batch_size 1 --deepspeed LMFlow/configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2024-06-12 19:36:11,661] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-12 19:36:12,366] [INFO] [launch.py:138:main] 0 TORCH_NCCL_BLOCKING_WAIT=1
[2024-06-12 19:36:12,366] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-06-12 19:36:12,366] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-06-12 19:36:12,366] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-06-12 19:36:12,366] [INFO] [launch.py:163:main] dist_world_size=2
[2024-06-12 19:36:12,366] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-06-12 19:36:12,419] [INFO] [launch.py:253:main] process 40472 spawned with command: ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1']
[2024-06-12 19:36:12,466] [INFO] [launch.py:253:main] process 40473 spawned with command: ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=1', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1']
[2024-06-12 19:36:17,298] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-12 19:36:17,298] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-06-12 19:36:20,965] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-12 19:36:20,965] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-06-12 19:36:20,965] [INFO] [comm.py:637:init_distributed] cdb=None
[rank1]: Traceback (most recent call last):
[rank1]: File "LMFlow/examples/finetune.py", line 61, in
[rank1]: main()
[rank1]: File "LMFlow/examples/finetune.py", line 44, in main
[rank1]: model_args, data_args, pipeline_args = parser.parse_args_into_dataclasses()
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
[rank1]: obj = dtype(**inputs)
[rank1]: File "", line 135, in init
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/training_args.py", line 1641, in post_init
[rank1]: and (self.device.type == "cpu" and not is_torch_greater_or_equal_than_2_3)
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/training_args.py", line 2149, in device
[rank1]: return self._setup_devices
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/utils/generic.py", line 59, in get
[rank1]: cached = self.fget(obj)
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/training_args.py", line 2077, in _setup_devices
[rank1]: self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/accelerate/state.py", line 280, in init
[rank1]: self.set_device()
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/accelerate/state.py", line 790, in set_device
[rank1]: torch.cuda.set_device(self.device)
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/cuda/init.py", line 399, in set_device
[rank1]: torch._C._cuda_setDevice(device)
[rank1]: RuntimeError: CUDA error: invalid device ordinal
[rank1]: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.06/12/2024 19:36:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True
anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
warnings.warn(
[2024-06-12 19:36:22,477] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 40472
[2024-06-12 19:36:22,531] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 40473
[2024-06-12 19:36:22,531] [ERROR] [launch.py:322:sigkill_handler] ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=1', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1
`#!/bin/bash
Please run this script under ${project_id} in project directory of
https://github.com/shizhediao/llm-ft
COMMIT: d5fecf30ba8011067b10cf51fede53a5ab6574e4
export TORCH_SHOW_CPP_STACKTRACES = 1
export TORCH_NCCL_BLOCKING_WAIT=1
export CUDA_LAUNCH_BLOCKING=1
export TORCH_USE_CUDA_DSA=1
Parses arguments
model_name_or_path=huggingface/hub/Meta-Llama-3-70B
dataset_path=LMFlow/data/alpaca/train_conversation
output_dir=output_models/finetune
deepspeed_args="--num_gpus=2 --master_port=11000"
conversation_template=llama3
Safety related arguments
trust_remote_code=0
while [[ $# -ge 1 ]]; do
key="$1"
case ${key} in
-m|--model_name_or_path)
model_name_or_path="$2"
shift
;;
-d|--dataset_path)
dataset_path="$2"
shift
;;
-o|--output_model_path)
output_dir="$2"
shift
;;
--conversation_template)
conversation_template="$2"
shift
;;
--deepspeed_args)
deepspeed_args="$2"
shift
;;
--trust_remote_code)
trust_remote_code="$2"
shift
;;
*)
echo "error: unknown option "${key}"" 1>&2
exit 1
esac
shift
done
Finetune
exp_id=finetune
project_dir=$(cd "$(dirname $0)"/..; pwd)
log_dir=${project_dir}/log/${exp_id}
mkdir -p ${output_dir} ${log_dir}
deepspeed ${deepspeed_args}
LMFlow/examples/finetune.py
--model_name_or_path ${model_name_or_path}
--trust_remote_code ${trust_remote_code}
--dataset_path ${dataset_path}
--output_dir ${output_dir} --overwrite_output_dir
--conversation_template ${conversation_template}
--num_train_epochs 0.01
--learning_rate 2e-5
--disable_group_texts 1
--block_size 256
--per_device_train_batch_size 1
--deepspeed LMFlow/configs/ds_config_zero3.json
--fp16
--run_name finetune
--validation_split_percentage 0
--logging_steps 20
--do_train
--ddp_timeout 72000
--save_steps 5000
--dataloader_num_workers 1
| tee ${log_dir}/train.log
2> ${log_dir}/train.err`
The text was updated successfully, but these errors were encountered: