You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cuda:12.1
pytorch:2.3.1
python:3.10
gpu:4 a800(4*80g)
ubuntu:22.04
apex is OK
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
The official example scripts / 官方的示例脚本
My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
I use my own data to finetune it, I update the dataset.py.
my dataset.py:
import os
import logging
import random
import logging
import jsonlines
from io import BytesIO
from PIL import Image
from torch.utils.data import Dataset
from sat.helpers import print_rank0
import json
with open(captions_file, 'r', encoding='utf-8') as file:
captions = json.load(file)
定义函数,根据图片文件名查找并返回描述
def find_caption_by_filename(filename, captions_dict):
# 检查文件名是否在captions_dict中
if filename in captions_dict:
# 返回对应的描述
return captions_dict[filename]
else:
# 如果文件名不存在,返回None或一个错误消息
return None # 或者 "Description not found for this filename."
def find_all_files(path, suffix=".jpg"):
target_files = []
for cur_dir, _, files in os.walk(path, followlinks=True):
for f in files:
if f.endswith(suffix):
target_files.append(os.path.join(cur_dir, f))
print_rank0(f'find {len(target_files)} files...')
return target_files
System Info / 系統信息
cuda:12.1
pytorch:2.3.1
python:3.10
gpu:4 a800(4*80g)
ubuntu:22.04
apex is OK
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
Reproduction / 复现过程
I use my own data to finetune it, I update the dataset.py.
my dataset.py:
import os
import logging
import random
import logging
import jsonlines
from io import BytesIO
from PIL import Image
from torch.utils.data import Dataset
from sat.helpers import print_rank0
import json
captions_file = '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/captions.json'
加载captions.json文件
with open(captions_file, 'r', encoding='utf-8') as file:
captions = json.load(file)
定义函数,根据图片文件名查找并返回描述
def find_caption_by_filename(filename, captions_dict):
# 检查文件名是否在captions_dict中
if filename in captions_dict:
# 返回对应的描述
return captions_dict[filename]
else:
# 如果文件名不存在,返回None或一个错误消息
return None # 或者 "Description not found for this filename."
def find_all_files(path, suffix=".jpg"):
target_files = []
for cur_dir, _, files in os.walk(path, followlinks=True):
for f in files:
if f.endswith(suffix):
target_files.append(os.path.join(cur_dir, f))
print_rank0(f'find {len(target_files)} files...')
return target_files
class ItemDataset(Dataset):
def init(self, image_processor, text_processor, args, data_dirs, cross_image_processor=None, **kwargs):
super().init()
self.data = self.load_data(data_dirs)
self.image_processor, self.text_processor, self.cross_image_processor = image_processor, text_processor, cross_image_processor
my script:
#! /bin/bash
export PATH=/GLOBALFS/dhu_mbzhao_1/cuda/bin:$PATH
export LD_LIBRARY_PATH=/GLOBALFS/dhu_mbzhao_1/cuda/lib64:$LD_LIBRARY_PATH
NUM_GPUS_PER_WORKER=4
MP_SIZE=1
script_path=$(realpath $0)
script_dir=$(dirname $script_path)
main_dir=$(dirname $script_dir)
MODEL_TYPE="cogvlm-chat-v1.1"
VERSION="base"
MODEL_ARGS="--from_pretrained $MODEL_TYPE
--max_length 1288
--lora_rank 10
--use_lora
--local_tokenizer /GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5
--version $VERSION"
Tips: If training models of resolution 244, you can set --max_length smaller
OPTIONS_SAT="SAT_HOME=/GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models"
OPTIONS_NCCL="NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 LOCAL_WORLD_SIZE=$NUM_GPUS_PER_WORKER"
HOST_FILE_PATH="hostfile"
train_data="./archive_split/train"
valid_data="./archive_split/valid"
gpt_options="
--experiment-name finetune-$MODEL_TYPE
--model-parallel-size ${MP_SIZE}
--mode finetune
--train-iters 800
--resume-dataloader
$MODEL_ARGS
--train-data ${train_data}
--valid-data ${valid_data}
--distributed-backend nccl
--lr-decay-style cosine
--warmup .02
--checkpoint-activations
--vit_checkpoint_activations
--save-interval 200
--eval-interval 200
--save "./checkpoints"
--eval-iters 10
--eval-batch-size 1
--split 1.
--deepspeed_config test_config_bf16.json
--skip-init
--seed 2023
"
run_cmd="${OPTIONS_NCCL} ${OPTIONS_SAT} deepspeed --master_port 16666 --hostfile ${HOST_FILE_PATH} finetune_cogvlm_demo.py ${gpt_options}"
echo ${run_cmd}
eval ${run_cmd}
set +x
Below is the log:
(cogvlm) dhu_mbzhao_1@deeplearning-v191204-deeplearn:~/CogVLM-main/finetune_demo$ sh finetune_cogvlm_lora.sh
NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 LOCAL_WORLD_SIZE=4 SAT_HOME=/GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models deepspeed --master_port 16666 --hostfile hostfile finetune_cogvlm_demo.py --experiment-name finetune-cogvlm-chat-v1.1 --model-parallel-size 1 --mode finetune --train-iters 800 --resume-dataloader --from_pretrained cogvlm-chat-v1.1 --max_length 1288 --lora_rank 10 --use_lora --local_tokenizer /GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5 --version base --train-data ./archive_split/train --valid-data ./archive_split/valid --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --vit_checkpoint_activations --save-interval 200 --eval-interval 200 --save ./checkpoints --eval-iters 10 --eval-batch-size 1 --split 1. --deepspeed_config test_config_bf16.json --skip-init --seed 2023
[2024-07-18 15:03:39,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-18 15:03:40,797] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-07-18 15:03:40,797] [INFO] [runner.py:568:main] cmd = /GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_cogvlm_demo.py --experiment-name finetune-cogvlm-chat-v1.1 --model-parallel-size 1 --mode finetune --train-iters 800 --resume-dataloader --from_pretrained cogvlm-chat-v1.1 --max_length 1288 --lora_rank 10 --use_lora --local_tokenizer /GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5 --version base --train-data ./archive_split/train --valid-data ./archive_split/valid --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --vit_checkpoint_activations --save-interval 200 --eval-interval 200 --save ./checkpoints --eval-iters 10 --eval-batch-size 1 --split 1. --deepspeed_config test_config_bf16.json --skip-init --seed 2023
[2024-07-18 15:03:42,018] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-18 15:03:43,636] [INFO] [launch.py:139:main] 0 NCCL_DEBUG=info
[2024-07-18 15:03:43,636] [INFO] [launch.py:139:main] 0 NCCL_IB_DISABLE=0
[2024-07-18 15:03:43,636] [INFO] [launch.py:139:main] 0 NCCL_NET_GDR_LEVEL=2
[2024-07-18 15:03:43,636] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2024-07-18 15:03:43,636] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=4, node_rank=0
[2024-07-18 15:03:43,636] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2024-07-18 15:03:43,636] [INFO] [launch.py:164:main] dist_world_size=4
[2024-07-18 15:03:43,636] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2024-07-18 15:03:43,637] [INFO] [launch.py:256:main] process 56061 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=0', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:43,637] [INFO] [launch.py:256:main] process 56062 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=1', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:43,637] [INFO] [launch.py:256:main] process 56063 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=2', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:43,638] [INFO] [launch.py:256:main] process 56064 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=3', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:44,906] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-18 15:03:44,968] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-18 15:03:44,971] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-18 15:03:44,972] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-18 15:03:49,192] [INFO] using world size: 4 and model-parallel size: 1
[2024-07-18 15:03:49,192] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
[2024-07-18 15:03:49,192] [INFO] Will override arguments with manually specified deepspeed_config!
[2024-07-18 15:03:49,326] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,331] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,353] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,361] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-07-18 15:03:49,363] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,366] [INFO] [checkpointing.py:1048:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2024-07-18 15:03:49,369] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 4741 and data parallel seed: 2023
[2024-07-18 15:03:49,372] [INFO] [RANK 0] building FineTuneTrainCogVLMModel model ...
[2024-07-18 15:03:59,465] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 17639685376
[2024-07-18 15:04:54,090] [INFO] [RANK 0] global rank 0 is loading checkpoint /GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models/cogvlm-chat-v1.1/1/mp_rank_00_model_states.pt
[2024-07-18 15:05:43,077] [INFO] [RANK 0] > successfully loaded /GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models/cogvlm-chat-v1.1/1/mp_rank_00_model_states.pt
[2024-07-18 15:05:44,114] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-07-18 15:05:44,864] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-07-18 15:05:45,654] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-07-18 15:05:46,351] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-07-18 15:05:47,077] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-07-18 15:05:47,871] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-07-18 15:05:48,692] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-07-18 15:05:49,551] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-07-18 15:05:50,375] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-07-18 15:05:51,153] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-07-18 15:05:51,949] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-07-18 15:05:52,892] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-07-18 15:05:53,677] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-07-18 15:05:54,587] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-07-18 15:05:55,295] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-07-18 15:05:56,079] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-07-18 15:05:56,938] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-07-18 15:05:57,762] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-07-18 15:05:58,654] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-07-18 15:05:59,468] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-07-18 15:06:00,300] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-07-18 15:06:01,055] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-07-18 15:06:02,043] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-07-18 15:06:02,786] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-07-18 15:06:03,570] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-07-18 15:06:04,406] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-07-18 15:06:05,249] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-07-18 15:06:06,080] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-07-18 15:06:06,862] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-07-18 15:06:08,048] [INFO] [RANK 0] replacing layer 29 attention with lora
[2024-07-18 15:06:08,829] [INFO] [RANK 0] replacing layer 30 attention with lora
[2024-07-18 15:06:09,577] [INFO] [RANK 0] replacing layer 31 attention with lora
[2024-07-18 15:06:10,367] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-07-18 15:06:10,480] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-07-18 15:06:10,589] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-07-18 15:06:10,832] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-07-18 15:06:11,036] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-07-18 15:06:11,243] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-07-18 15:06:11,437] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-07-18 15:06:11,644] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-07-18 15:06:11,851] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-07-18 15:06:12,125] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-07-18 15:06:12,333] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-07-18 15:06:12,469] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-07-18 15:06:12,655] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-07-18 15:06:12,857] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-07-18 15:06:13,064] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-07-18 15:06:13,325] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-07-18 15:06:13,541] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-07-18 15:06:13,763] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-07-18 15:06:14,028] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-07-18 15:06:14,241] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-07-18 15:06:14,443] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-07-18 15:06:14,642] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-07-18 15:06:14,843] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-07-18 15:06:15,035] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-07-18 15:06:15,226] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-07-18 15:06:15,443] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-07-18 15:06:15,626] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-07-18 15:06:15,832] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-07-18 15:06:15,997] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-07-18 15:06:16,190] [INFO] [RANK 0] replacing layer 29 attention with lora
[2024-07-18 15:06:16,437] [INFO] [RANK 0] replacing layer 30 attention with lora
[2024-07-18 15:06:16,639] [INFO] [RANK 0] replacing layer 31 attention with lora
[2024-07-18 15:06:16,846] [INFO] [RANK 0] replacing layer 32 attention with lora
[2024-07-18 15:06:17,052] [INFO] [RANK 0] replacing layer 33 attention with lora
[2024-07-18 15:06:17,250] [INFO] [RANK 0] replacing layer 34 attention with lora
[2024-07-18 15:06:17,453] [INFO] [RANK 0] replacing layer 35 attention with lora
[2024-07-18 15:06:17,652] [INFO] [RANK 0] replacing layer 36 attention with lora
[2024-07-18 15:06:17,926] [INFO] [RANK 0] replacing layer 37 attention with lora
[2024-07-18 15:06:18,139] [INFO] [RANK 0] replacing layer 38 attention with lora
[2024-07-18 15:06:18,348] [INFO] [RANK 0] replacing layer 39 attention with lora
[2024-07-18 15:06:18,540] [INFO] [RANK 0] replacing layer 40 attention with lora
[2024-07-18 15:06:18,741] [INFO] [RANK 0] replacing layer 41 attention with lora
[2024-07-18 15:06:18,934] [INFO] [RANK 0] replacing layer 42 attention with lora
[2024-07-18 15:06:19,126] [INFO] [RANK 0] replacing layer 43 attention with lora
[2024-07-18 15:06:19,346] [INFO] [RANK 0] replacing layer 44 attention with lora
[2024-07-18 15:06:19,545] [INFO] [RANK 0] replacing layer 45 attention with lora
[2024-07-18 15:06:19,745] [INFO] [RANK 0] replacing layer 46 attention with lora
[2024-07-18 15:06:19,930] [INFO] [RANK 0] replacing layer 47 attention with lora
[2024-07-18 15:06:20,122] [INFO] [RANK 0] replacing layer 48 attention with lora
[2024-07-18 15:06:20,327] [INFO] [RANK 0] replacing layer 49 attention with lora
[2024-07-18 15:06:20,534] [INFO] [RANK 0] replacing layer 50 attention with lora
[2024-07-18 15:06:20,733] [INFO] [RANK 0] replacing layer 51 attention with lora
[2024-07-18 15:06:20,970] [INFO] [RANK 0] replacing layer 52 attention with lora
[2024-07-18 15:06:21,163] [INFO] [RANK 0] replacing layer 53 attention with lora
[2024-07-18 15:06:21,424] [INFO] [RANK 0] replacing layer 54 attention with lora
[2024-07-18 15:06:21,643] [INFO] [RANK 0] replacing layer 55 attention with lora
[2024-07-18 15:06:21,842] [INFO] [RANK 0] replacing layer 56 attention with lora
[2024-07-18 15:06:22,030] [INFO] [RANK 0] replacing layer 57 attention with lora
[2024-07-18 15:06:22,230] [INFO] [RANK 0] replacing layer 58 attention with lora
[2024-07-18 15:06:22,433] [INFO] [RANK 0] replacing layer 59 attention with lora
[2024-07-18 15:06:22,580] [INFO] [RANK 0] replacing layer 60 attention with lora
[2024-07-18 15:06:22,780] [INFO] [RANK 0] replacing layer 61 attention with lora
[2024-07-18 15:06:23,041] [INFO] [RANK 0] replacing layer 62 attention with lora
[2024-07-18 15:06:23,776] [INFO] [RANK 0] find 0 files...
[2024-07-18 15:06:23,776] [INFO] [RANK 0] find 0 samples in all...
[rank3]: Traceback (most recent call last):
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank3]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank3]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank3]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank3]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank3]: ZeroDivisionError: integer division or modulo by zero
[rank0]: Traceback (most recent call last):
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank0]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank0]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank0]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank0]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank0]: ZeroDivisionError: integer division or modulo by zero
[rank2]: Traceback (most recent call last):
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank2]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank2]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank2]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank2]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank2]: ZeroDivisionError: integer division or modulo by zero
[rank1]: Traceback (most recent call last):
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank1]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank1]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank1]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank1]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank1]: ZeroDivisionError: integer division or modulo by zero
[2024-07-18 15:06:25,946] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56061
[2024-07-18 15:06:25,949] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56062
[2024-07-18 15:06:25,952] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56063
[2024-07-18 15:06:25,952] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56064
[2024-07-18 15:06:25,954] [ERROR] [launch.py:325:sigkill_handler] ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=3', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023'] exits with return code = 1
This place has a divisor of 0, I don't know how to solve it.
Could someone can help me?
Expected behavior / 期待表现
finetune is OK.
The text was updated successfully, but these errors were encountered: