Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a trouble in finetune chatglm: ZeroDivisionError: integer division or modulo by zero #509

Open
2 tasks done
Originlightwkp opened this issue Jul 18, 2024 · 1 comment
Open
2 tasks done

Comments

@Originlightwkp
Copy link

Originlightwkp commented Jul 18, 2024

System Info / 系統信息

cuda:12.1
pytorch:2.3.1
python:3.10
gpu:4 a800(4*80g)
ubuntu:22.04
apex is OK

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

I use my own data to finetune it, I update the dataset.py.

my dataset.py:
import os
import logging
import random
import logging
import jsonlines
from io import BytesIO
from PIL import Image
from torch.utils.data import Dataset
from sat.helpers import print_rank0
import json

captions_file = '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/captions.json'

加载captions.json文件

with open(captions_file, 'r', encoding='utf-8') as file:
captions = json.load(file)

定义函数,根据图片文件名查找并返回描述

def find_caption_by_filename(filename, captions_dict):
# 检查文件名是否在captions_dict中
if filename in captions_dict:
# 返回对应的描述
return captions_dict[filename]
else:
# 如果文件名不存在,返回None或一个错误消息
return None # 或者 "Description not found for this filename."
def find_all_files(path, suffix=".jpg"):
target_files = []
for cur_dir, _, files in os.walk(path, followlinks=True):
for f in files:
if f.endswith(suffix):
target_files.append(os.path.join(cur_dir, f))
print_rank0(f'find {len(target_files)} files...')
return target_files

class ItemDataset(Dataset):
def init(self, image_processor, text_processor, args, data_dirs, cross_image_processor=None, **kwargs):
super().init()
self.data = self.load_data(data_dirs)
self.image_processor, self.text_processor, self.cross_image_processor = image_processor, text_processor, cross_image_processor

def process_img(self, img):
    img_dict = {'vision': self.image_processor(img)}
    if self.cross_image_processor:
        img_dict.update({'cross': self.cross_image_processor(img)})
    return img_dict

def process_text(self, answer, prompt):
    return self.text_processor(answer, prompt)

def load_data(self, data_dir):
    all_files = find_all_files(data_dir, suffix=".jpg")
    print_rank0(f"find {len(all_files)} samples in all...")
    return all_files

def __len__(self):
    return len(self.data)

def __getitem__(self, index):
    data = self.data[index]
    # img
    try:
        img = Image.open(data).convert('RGB')
    except Exception as e:
        print_rank0(e, level=logging.WARNING)
        return {}
    img_dict = self.process_img(img)
    # text
    #label = data.split('/')[-1].split('.')[0]
    label = find_caption_by_filename(data, captions)
    #uni_key = label #唯一id
    uni_key = random.randint(0, 100000)#随机数代替,扩2倍
    text_dict = self.process_text(label, "CLOTH:")
    if text_dict is None:
        print_rank0(f"Process text failed. Please check the max_target_length & max_source_length.\n The data is {data}", level=logging.WARNING)
        return {}
    # other attr
    ret = {**img_dict, **text_dict, "question_id": uni_key}
    return ret

my script:
#! /bin/bash
export PATH=/GLOBALFS/dhu_mbzhao_1/cuda/bin:$PATH
export LD_LIBRARY_PATH=/GLOBALFS/dhu_mbzhao_1/cuda/lib64:$LD_LIBRARY_PATH

NUM_GPUS_PER_WORKER=4
MP_SIZE=1

script_path=$(realpath $0)
script_dir=$(dirname $script_path)
main_dir=$(dirname $script_dir)
MODEL_TYPE="cogvlm-chat-v1.1"
VERSION="base"
MODEL_ARGS="--from_pretrained $MODEL_TYPE
--max_length 1288
--lora_rank 10
--use_lora
--local_tokenizer /GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5
--version $VERSION"

Tips: If training models of resolution 244, you can set --max_length smaller

OPTIONS_SAT="SAT_HOME=/GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models"
OPTIONS_NCCL="NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 LOCAL_WORLD_SIZE=$NUM_GPUS_PER_WORKER"
HOST_FILE_PATH="hostfile"

train_data="./archive_split/train"
valid_data="./archive_split/valid"

gpt_options="
--experiment-name finetune-$MODEL_TYPE
--model-parallel-size ${MP_SIZE}
--mode finetune
--train-iters 800
--resume-dataloader
$MODEL_ARGS
--train-data ${train_data}
--valid-data ${valid_data}
--distributed-backend nccl
--lr-decay-style cosine
--warmup .02
--checkpoint-activations
--vit_checkpoint_activations
--save-interval 200
--eval-interval 200
--save "./checkpoints"
--eval-iters 10
--eval-batch-size 1
--split 1.
--deepspeed_config test_config_bf16.json
--skip-init
--seed 2023
"

run_cmd="${OPTIONS_NCCL} ${OPTIONS_SAT} deepspeed --master_port 16666 --hostfile ${HOST_FILE_PATH} finetune_cogvlm_demo.py ${gpt_options}"
echo ${run_cmd}
eval ${run_cmd}

set +x

Below is the log:

(cogvlm) dhu_mbzhao_1@deeplearning-v191204-deeplearn:~/CogVLM-main/finetune_demo$ sh finetune_cogvlm_lora.sh
NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 LOCAL_WORLD_SIZE=4 SAT_HOME=/GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models deepspeed --master_port 16666 --hostfile hostfile finetune_cogvlm_demo.py --experiment-name finetune-cogvlm-chat-v1.1 --model-parallel-size 1 --mode finetune --train-iters 800 --resume-dataloader --from_pretrained cogvlm-chat-v1.1 --max_length 1288 --lora_rank 10 --use_lora --local_tokenizer /GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5 --version base --train-data ./archive_split/train --valid-data ./archive_split/valid --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --vit_checkpoint_activations --save-interval 200 --eval-interval 200 --save ./checkpoints --eval-iters 10 --eval-batch-size 1 --split 1. --deepspeed_config test_config_bf16.json --skip-init --seed 2023
[2024-07-18 15:03:39,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-18 15:03:40,797] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-07-18 15:03:40,797] [INFO] [runner.py:568:main] cmd = /GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_cogvlm_demo.py --experiment-name finetune-cogvlm-chat-v1.1 --model-parallel-size 1 --mode finetune --train-iters 800 --resume-dataloader --from_pretrained cogvlm-chat-v1.1 --max_length 1288 --lora_rank 10 --use_lora --local_tokenizer /GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5 --version base --train-data ./archive_split/train --valid-data ./archive_split/valid --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --vit_checkpoint_activations --save-interval 200 --eval-interval 200 --save ./checkpoints --eval-iters 10 --eval-batch-size 1 --split 1. --deepspeed_config test_config_bf16.json --skip-init --seed 2023
[2024-07-18 15:03:42,018] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-18 15:03:43,636] [INFO] [launch.py:139:main] 0 NCCL_DEBUG=info
[2024-07-18 15:03:43,636] [INFO] [launch.py:139:main] 0 NCCL_IB_DISABLE=0
[2024-07-18 15:03:43,636] [INFO] [launch.py:139:main] 0 NCCL_NET_GDR_LEVEL=2
[2024-07-18 15:03:43,636] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2024-07-18 15:03:43,636] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=4, node_rank=0
[2024-07-18 15:03:43,636] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2024-07-18 15:03:43,636] [INFO] [launch.py:164:main] dist_world_size=4
[2024-07-18 15:03:43,636] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2024-07-18 15:03:43,637] [INFO] [launch.py:256:main] process 56061 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=0', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:43,637] [INFO] [launch.py:256:main] process 56062 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=1', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:43,637] [INFO] [launch.py:256:main] process 56063 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=2', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:43,638] [INFO] [launch.py:256:main] process 56064 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=3', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:44,906] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-18 15:03:44,968] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-18 15:03:44,971] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-18 15:03:44,972] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-18 15:03:49,192] [INFO] using world size: 4 and model-parallel size: 1
[2024-07-18 15:03:49,192] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
[2024-07-18 15:03:49,192] [INFO] Will override arguments with manually specified deepspeed_config!
[2024-07-18 15:03:49,326] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,331] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,353] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,361] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-07-18 15:03:49,363] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,366] [INFO] [checkpointing.py:1048:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2024-07-18 15:03:49,369] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 4741 and data parallel seed: 2023
[2024-07-18 15:03:49,372] [INFO] [RANK 0] building FineTuneTrainCogVLMModel model ...
[2024-07-18 15:03:59,465] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 17639685376
[2024-07-18 15:04:54,090] [INFO] [RANK 0] global rank 0 is loading checkpoint /GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models/cogvlm-chat-v1.1/1/mp_rank_00_model_states.pt
[2024-07-18 15:05:43,077] [INFO] [RANK 0] > successfully loaded /GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models/cogvlm-chat-v1.1/1/mp_rank_00_model_states.pt
[2024-07-18 15:05:44,114] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-07-18 15:05:44,864] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-07-18 15:05:45,654] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-07-18 15:05:46,351] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-07-18 15:05:47,077] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-07-18 15:05:47,871] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-07-18 15:05:48,692] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-07-18 15:05:49,551] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-07-18 15:05:50,375] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-07-18 15:05:51,153] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-07-18 15:05:51,949] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-07-18 15:05:52,892] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-07-18 15:05:53,677] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-07-18 15:05:54,587] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-07-18 15:05:55,295] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-07-18 15:05:56,079] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-07-18 15:05:56,938] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-07-18 15:05:57,762] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-07-18 15:05:58,654] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-07-18 15:05:59,468] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-07-18 15:06:00,300] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-07-18 15:06:01,055] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-07-18 15:06:02,043] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-07-18 15:06:02,786] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-07-18 15:06:03,570] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-07-18 15:06:04,406] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-07-18 15:06:05,249] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-07-18 15:06:06,080] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-07-18 15:06:06,862] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-07-18 15:06:08,048] [INFO] [RANK 0] replacing layer 29 attention with lora
[2024-07-18 15:06:08,829] [INFO] [RANK 0] replacing layer 30 attention with lora
[2024-07-18 15:06:09,577] [INFO] [RANK 0] replacing layer 31 attention with lora
[2024-07-18 15:06:10,367] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-07-18 15:06:10,480] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-07-18 15:06:10,589] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-07-18 15:06:10,832] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-07-18 15:06:11,036] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-07-18 15:06:11,243] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-07-18 15:06:11,437] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-07-18 15:06:11,644] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-07-18 15:06:11,851] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-07-18 15:06:12,125] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-07-18 15:06:12,333] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-07-18 15:06:12,469] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-07-18 15:06:12,655] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-07-18 15:06:12,857] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-07-18 15:06:13,064] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-07-18 15:06:13,325] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-07-18 15:06:13,541] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-07-18 15:06:13,763] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-07-18 15:06:14,028] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-07-18 15:06:14,241] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-07-18 15:06:14,443] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-07-18 15:06:14,642] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-07-18 15:06:14,843] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-07-18 15:06:15,035] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-07-18 15:06:15,226] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-07-18 15:06:15,443] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-07-18 15:06:15,626] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-07-18 15:06:15,832] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-07-18 15:06:15,997] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-07-18 15:06:16,190] [INFO] [RANK 0] replacing layer 29 attention with lora
[2024-07-18 15:06:16,437] [INFO] [RANK 0] replacing layer 30 attention with lora
[2024-07-18 15:06:16,639] [INFO] [RANK 0] replacing layer 31 attention with lora
[2024-07-18 15:06:16,846] [INFO] [RANK 0] replacing layer 32 attention with lora
[2024-07-18 15:06:17,052] [INFO] [RANK 0] replacing layer 33 attention with lora
[2024-07-18 15:06:17,250] [INFO] [RANK 0] replacing layer 34 attention with lora
[2024-07-18 15:06:17,453] [INFO] [RANK 0] replacing layer 35 attention with lora
[2024-07-18 15:06:17,652] [INFO] [RANK 0] replacing layer 36 attention with lora
[2024-07-18 15:06:17,926] [INFO] [RANK 0] replacing layer 37 attention with lora
[2024-07-18 15:06:18,139] [INFO] [RANK 0] replacing layer 38 attention with lora
[2024-07-18 15:06:18,348] [INFO] [RANK 0] replacing layer 39 attention with lora
[2024-07-18 15:06:18,540] [INFO] [RANK 0] replacing layer 40 attention with lora
[2024-07-18 15:06:18,741] [INFO] [RANK 0] replacing layer 41 attention with lora
[2024-07-18 15:06:18,934] [INFO] [RANK 0] replacing layer 42 attention with lora
[2024-07-18 15:06:19,126] [INFO] [RANK 0] replacing layer 43 attention with lora
[2024-07-18 15:06:19,346] [INFO] [RANK 0] replacing layer 44 attention with lora
[2024-07-18 15:06:19,545] [INFO] [RANK 0] replacing layer 45 attention with lora
[2024-07-18 15:06:19,745] [INFO] [RANK 0] replacing layer 46 attention with lora
[2024-07-18 15:06:19,930] [INFO] [RANK 0] replacing layer 47 attention with lora
[2024-07-18 15:06:20,122] [INFO] [RANK 0] replacing layer 48 attention with lora
[2024-07-18 15:06:20,327] [INFO] [RANK 0] replacing layer 49 attention with lora
[2024-07-18 15:06:20,534] [INFO] [RANK 0] replacing layer 50 attention with lora
[2024-07-18 15:06:20,733] [INFO] [RANK 0] replacing layer 51 attention with lora
[2024-07-18 15:06:20,970] [INFO] [RANK 0] replacing layer 52 attention with lora
[2024-07-18 15:06:21,163] [INFO] [RANK 0] replacing layer 53 attention with lora
[2024-07-18 15:06:21,424] [INFO] [RANK 0] replacing layer 54 attention with lora
[2024-07-18 15:06:21,643] [INFO] [RANK 0] replacing layer 55 attention with lora
[2024-07-18 15:06:21,842] [INFO] [RANK 0] replacing layer 56 attention with lora
[2024-07-18 15:06:22,030] [INFO] [RANK 0] replacing layer 57 attention with lora
[2024-07-18 15:06:22,230] [INFO] [RANK 0] replacing layer 58 attention with lora
[2024-07-18 15:06:22,433] [INFO] [RANK 0] replacing layer 59 attention with lora
[2024-07-18 15:06:22,580] [INFO] [RANK 0] replacing layer 60 attention with lora
[2024-07-18 15:06:22,780] [INFO] [RANK 0] replacing layer 61 attention with lora
[2024-07-18 15:06:23,041] [INFO] [RANK 0] replacing layer 62 attention with lora
[2024-07-18 15:06:23,776] [INFO] [RANK 0] find 0 files...
[2024-07-18 15:06:23,776] [INFO] [RANK 0] find 0 samples in all...
[rank3]: Traceback (most recent call last):
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank3]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank3]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank3]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank3]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank3]: ZeroDivisionError: integer division or modulo by zero
[rank0]: Traceback (most recent call last):
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank0]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank0]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank0]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank0]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank0]: ZeroDivisionError: integer division or modulo by zero
[rank2]: Traceback (most recent call last):
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank2]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank2]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank2]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank2]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank2]: ZeroDivisionError: integer division or modulo by zero
[rank1]: Traceback (most recent call last):
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank1]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank1]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank1]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank1]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank1]: ZeroDivisionError: integer division or modulo by zero
[2024-07-18 15:06:25,946] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56061
[2024-07-18 15:06:25,949] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56062
[2024-07-18 15:06:25,952] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56063
[2024-07-18 15:06:25,952] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56064
[2024-07-18 15:06:25,954] [ERROR] [launch.py:325:sigkill_handler] ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=3', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023'] exits with return code = 1

This place has a divisor of 0, I don't know how to solve it.
Could someone can help me?

Expected behavior / 期待表现

finetune is OK.

@Shawnzheng011019
Copy link

I have the same trouble and I dont know how to solve it as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants