You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(lmflow_train) root@duxact:/data/projects/lmflow/LMFlow# ./scripts/run_finetune.sh
--model_name_or_path /data/guihunmodel8.8B
--dataset_path /data/projects/lmflow/case_report_data
--output_model_path /data/projects/lmflow/guihun_fintune_model
[2024-05-22 15:23:02,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:05,346] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-05-22 15:23:05,346] [INFO] [runner.py:555:main] cmd = /root/anaconda3/envs/lmflow_train/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path /data/guihunmodel8.8B --trust_remote_code 0 --dataset_path /data/projects/lmflow/case_report_data --output_dir /data/projects/lmflow/guihun_fintune_model --overwrite_output_dir --conversation_template llama2 --num_train_epochs 0.01 --learning_rate 2e-5 --disable_group_texts 1 --block_size 256 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2024-05-22 15:23:07,178] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:08,889] [INFO] [launch.py:138:main] 0 NCCL_P2P_DISABLE=1
[2024-05-22 15:23:08,889] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=1
[2024-05-22 15:23:08,889] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2024-05-22 15:23:08,889] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=3, node_rank=0
[2024-05-22 15:23:08,889] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2024-05-22 15:23:08,889] [INFO] [launch.py:163:main] dist_world_size=3
[2024-05-22 15:23:08,889] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
[2024-05-22 15:23:12,326] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:12,845] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:12,878] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-05-22 15:23:15,313] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-22 15:23:15,313] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-22 15:23:15,317] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-22 15:23:15,318] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-22 15:23:15,368] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-22 15:23:15,368] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-22 15:23:15,368] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
05/22/2024 15:23:16 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
warnings.warn(
05/22/2024 15:23:16 - WARNING - lmflow.pipeline.finetuner - Process rank: 2, device: cuda:2, n_gpu: 1,distributed training: True, 16-bits training: True
05/22/2024 15:23:16 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1,distributed training: True, 16-bits training: True
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
warnings.warn(
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
warnings.warn(
[WARNING|logging.py:314] 2024-05-22 15:23:18,032 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-05-22 15:23:18,186 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-05-22 15:23:18,236 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-05-22 15:23:20,000] [INFO] [partition_parameters.py:326:exit] finished initializing model with 8.03B parameters
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.00s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.00s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.06s/it]
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py39_cu121/cpu_adam...
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/lmflow_train/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -march=native -fopenmp -D__AVX512 -D__DISABLE_CUDA_ -c /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[2/2] c++ cpu_adam.o -shared -fopenmp -L/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 19.286750555038452 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 19.286848306655884 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 19.370280504226685 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2024-05-22 15:36:23,345] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 806929
[2024-05-22 15:36:23,707] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 806930
[2024-05-22 15:36:28,465] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 806931
[2024-05-22 15:36:33,281] [ERROR] [launch.py:321:sigkill_handler] ['/root/anaconda3/envs/lmflow_train/bin/python', '-u', 'examples/finetune.py', '--local_rank=2', '--model_name_or_path', '/data/guihunmodel8.8B', '--trust_remote_code', '0', '--dataset_path', '/data/projects/lmflow/case_report_data', '--output_dir', '/data/projects/lmflow/guihun_fintune_model', '--overwrite_output_dir', '--conversation_template', 'llama2', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -9
The text was updated successfully, but these errors were encountered:
Thanks for your interest in LMFlow! It seems that your system installed cuda and torch cuda do not match each other.
You may refer to: microsoft/DeepSpeed#3613
Feel free to leave a comment if you need further helps.
(lmflow_train) root@duxact:/data/projects/lmflow/LMFlow# ./scripts/run_finetune.sh
--model_name_or_path /data/guihunmodel8.8B
--dataset_path /data/projects/lmflow/case_report_data
--output_model_path /data/projects/lmflow/guihun_fintune_model
[2024-05-22 15:23:02,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:05,346] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-05-22 15:23:05,346] [INFO] [runner.py:555:main] cmd = /root/anaconda3/envs/lmflow_train/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path /data/guihunmodel8.8B --trust_remote_code 0 --dataset_path /data/projects/lmflow/case_report_data --output_dir /data/projects/lmflow/guihun_fintune_model --overwrite_output_dir --conversation_template llama2 --num_train_epochs 0.01 --learning_rate 2e-5 --disable_group_texts 1 --block_size 256 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2024-05-22 15:23:07,178] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:08,889] [INFO] [launch.py:138:main] 0 NCCL_P2P_DISABLE=1
[2024-05-22 15:23:08,889] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=1
[2024-05-22 15:23:08,889] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2024-05-22 15:23:08,889] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=3, node_rank=0
[2024-05-22 15:23:08,889] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2024-05-22 15:23:08,889] [INFO] [launch.py:163:main] dist_world_size=3
[2024-05-22 15:23:08,889] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
[2024-05-22 15:23:12,326] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:12,845] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:12,878] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-05-22 15:23:15,313] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-22 15:23:15,313] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-22 15:23:15,317] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-22 15:23:15,318] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-22 15:23:15,368] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-22 15:23:15,368] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-22 15:23:15,368] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
05/22/2024 15:23:16 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
warnings.warn(
05/22/2024 15:23:16 - WARNING - lmflow.pipeline.finetuner - Process rank: 2, device: cuda:2, n_gpu: 1,distributed training: True, 16-bits training: True
05/22/2024 15:23:16 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1,distributed training: True, 16-bits training: True
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
warnings.warn(
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
warnings.warn(
[WARNING|logging.py:314] 2024-05-22 15:23:18,032 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-05-22 15:23:18,186 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-05-22 15:23:18,236 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-05-22 15:23:20,000] [INFO] [partition_parameters.py:326:exit] finished initializing model with 8.03B parameters
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.00s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.00s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.06s/it]
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py39_cu121/cpu_adam...
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/lmflow_train/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -march=native -fopenmp -D__AVX512 -D__DISABLE_CUDA_ -c /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[2/2] c++ cpu_adam.o -shared -fopenmp -L/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 19.286750555038452 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 19.286848306655884 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 19.370280504226685 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2024-05-22 15:36:23,345] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 806929
[2024-05-22 15:36:23,707] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 806930
[2024-05-22 15:36:28,465] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 806931
[2024-05-22 15:36:33,281] [ERROR] [launch.py:321:sigkill_handler] ['/root/anaconda3/envs/lmflow_train/bin/python', '-u', 'examples/finetune.py', '--local_rank=2', '--model_name_or_path', '/data/guihunmodel8.8B', '--trust_remote_code', '0', '--dataset_path', '/data/projects/lmflow/case_report_data', '--output_dir', '/data/projects/lmflow/guihun_fintune_model', '--overwrite_output_dir', '--conversation_template', 'llama2', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -9
The text was updated successfully, but these errors were encountered: