a trouble in finetune chatglm: ZeroDivisionError: integer division or modulo by zero #509

Originlightwkp · 2024-07-18T15:21:59Z

System Info / 系統信息

cuda:12.1
pytorch:2.3.1
python:3.10
gpu:4 a800(4*80g)
ubuntu:22.04
apex is OK

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

The official example scripts / 官方的示例脚本
My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

I use my own data to finetune it, I update the dataset.py.

my dataset.py:
import os
import logging
import random
import logging
import jsonlines
from io import BytesIO
from PIL import Image
from torch.utils.data import Dataset
from sat.helpers import print_rank0
import json

captions_file = '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/captions.json'

加载captions.json文件

with open(captions_file, 'r', encoding='utf-8') as file:
captions = json.load(file)

定义函数，根据图片文件名查找并返回描述

def find_caption_by_filename(filename, captions_dict):
# 检查文件名是否在captions_dict中
if filename in captions_dict:
# 返回对应的描述
return captions_dict[filename]
else:
# 如果文件名不存在，返回None或一个错误消息
return None # 或者 "Description not found for this filename."
def find_all_files(path, suffix=".jpg"):
target_files = []
for cur_dir, _, files in os.walk(path, followlinks=True):
for f in files:
if f.endswith(suffix):
target_files.append(os.path.join(cur_dir, f))
print_rank0(f'find {len(target_files)} files...')
return target_files

class ItemDataset(Dataset):
def init(self, image_processor, text_processor, args, data_dirs, cross_image_processor=None, **kwargs):
super().init()
self.data = self.load_data(data_dirs)
self.image_processor, self.text_processor, self.cross_image_processor = image_processor, text_processor, cross_image_processor

def process_img(self, img):
    img_dict = {'vision': self.image_processor(img)}
    if self.cross_image_processor:
        img_dict.update({'cross': self.cross_image_processor(img)})
    return img_dict

def process_text(self, answer, prompt):
    return self.text_processor(answer, prompt)

def load_data(self, data_dir):
    all_files = find_all_files(data_dir, suffix=".jpg")
    print_rank0(f"find {len(all_files)} samples in all...")
    return all_files

def __len__(self):
    return len(self.data)

def __getitem__(self, index):
    data = self.data[index]
    # img
    try:
        img = Image.open(data).convert('RGB')
    except Exception as e:
        print_rank0(e, level=logging.WARNING)
        return {}
    img_dict = self.process_img(img)
    # text
    #label = data.split('/')[-1].split('.')[0]
    label = find_caption_by_filename(data, captions)
    #uni_key = label #唯一id
    uni_key = random.randint(0, 100000)#随机数代替，扩2倍
    text_dict = self.process_text(label, "CLOTH:")
    if text_dict is None:
        print_rank0(f"Process text failed. Please check the max_target_length & max_source_length.\n The data is {data}", level=logging.WARNING)
        return {}
    # other attr
    ret = {**img_dict, **text_dict, "question_id": uni_key}
    return ret

my script:
#! /bin/bash
export PATH=/GLOBALFS/dhu_mbzhao_1/cuda/bin:$PATH
export LD_LIBRARY_PATH=/GLOBALFS/dhu_mbzhao_1/cuda/lib64:$LD_LIBRARY_PATH

NUM_GPUS_PER_WORKER=4
MP_SIZE=1

script_path=$(realpath $0)
script_dir=$(dirname $script_path)
main_dir=$(dirname $script_dir)
MODEL_TYPE="cogvlm-chat-v1.1"
VERSION="base"
MODEL_ARGS="--from_pretrained $MODEL_TYPE
--max_length 1288
--lora_rank 10
--use_lora
--local_tokenizer /GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5
--version $VERSION"

Tips: If training models of resolution 244, you can set --max_length smaller

OPTIONS_SAT="SAT_HOME=/GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models"
OPTIONS_NCCL="NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 LOCAL_WORLD_SIZE=$NUM_GPUS_PER_WORKER"
HOST_FILE_PATH="hostfile"

train_data="./archive_split/train"
valid_data="./archive_split/valid"

gpt_options="
--experiment-name finetune-$MODEL_TYPE
--model-parallel-size ${MP_SIZE}
--mode finetune
--train-iters 800
--resume-dataloader
$MODEL_ARGS
--train-data ${train_data}
--valid-data ${valid_data}
--distributed-backend nccl
--lr-decay-style cosine
--warmup .02
--checkpoint-activations
--vit_checkpoint_activations
--save-interval 200
--eval-interval 200
--save "./checkpoints"
--eval-iters 10
--eval-batch-size 1
--split 1.
--deepspeed_config test_config_bf16.json
--skip-init
--seed 2023
"

run_cmd="${OPTIONS_NCCL} ${OPTIONS_SAT} deepspeed --master_port 16666 --hostfile ${HOST_FILE_PATH} finetune_cogvlm_demo.py ${gpt_options}"
echo ${run_cmd}
eval ${run_cmd}

set +x

Below is the log:

(cogvlm) dhu_mbzhao_1@deeplearning-v191204-deeplearn:~/CogVLM-main/finetune_demo$ sh finetune_cogvlm_lora.sh
NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 LOCAL_WORLD_SIZE=4 SAT_HOME=/GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models deepspeed --master_port 16666 --hostfile hostfile finetune_cogvlm_demo.py --experiment-name finetune-cogvlm-chat-v1.1 --model-parallel-size 1 --mode finetune --train-iters 800 --resume-dataloader --from_pretrained cogvlm-chat-v1.1 --max_length 1288 --lora_rank 10 --use_lora --local_tokenizer /GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5 --version base --train-data ./archive_split/train --valid-data ./archive_split/valid --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --vit_checkpoint_activations --save-interval 200 --eval-interval 200 --save ./checkpoints --eval-iters 10 --eval-batch-size 1 --split 1. --deepspeed_config test_config_bf16.json --skip-init --seed 2023
[2024-07-18 15:03:39,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-18 15:03:40,797] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-07-18 15:03:40,797] [INFO] [runner.py:568:main] cmd = /GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_cogvlm_demo.py --experiment-name finetune-cogvlm-chat-v1.1 --model-parallel-size 1 --mode finetune --train-iters 800 --resume-dataloader --from_pretrained cogvlm-chat-v1.1 --max_length 1288 --lora_rank 10 --use_lora --local_tokenizer /GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5 --version base --train-data ./archive_split/train --valid-data ./archive_split/valid --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --vit_checkpoint_activations --save-interval 200 --eval-interval 200 --save ./checkpoints --eval-iters 10 --eval-batch-size 1 --split 1. --deepspeed_config test_config_bf16.json --skip-init --seed 2023
[2024-07-18 15:03:42,018] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-18 15:03:43,636] [INFO] [launch.py:139:main] 0 NCCL_DEBUG=info
[2024-07-18 15:03:43,636] [INFO] [launch.py:139:main] 0 NCCL_IB_DISABLE=0
[2024-07-18 15:03:43,636] [INFO] [launch.py:139:main] 0 NCCL_NET_GDR_LEVEL=2
[2024-07-18 15:03:43,636] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2024-07-18 15:03:43,636] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=4, node_rank=0
[2024-07-18 15:03:43,636] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2024-07-18 15:03:43,636] [INFO] [launch.py:164:main] dist_world_size=4
[2024-07-18 15:03:43,636] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2024-07-18 15:03:43,637] [INFO] [launch.py:256:main] process 56061 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=0', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:43,637] [INFO] [launch.py:256:main] process 56062 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=1', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:43,637] [INFO] [launch.py:256:main] process 56063 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=2', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:43,638] [INFO] [launch.py:256:main] process 56064 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=3', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:44,906] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-18 15:03:44,968] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-18 15:03:44,971] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-18 15:03:44,972] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-18 15:03:49,192] [INFO] using world size: 4 and model-parallel size: 1
[2024-07-18 15:03:49,192] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
[2024-07-18 15:03:49,192] [INFO] Will override arguments with manually specified deepspeed_config!
[2024-07-18 15:03:49,326] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,331] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,353] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,361] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-07-18 15:03:49,363] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,366] [INFO] [checkpointing.py:1048:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2024-07-18 15:03:49,369] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 4741 and data parallel seed: 2023
[2024-07-18 15:03:49,372] [INFO] [RANK 0] building FineTuneTrainCogVLMModel model ...
[2024-07-18 15:03:59,465] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 17639685376
[2024-07-18 15:04:54,090] [INFO] [RANK 0] global rank 0 is loading checkpoint /GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models/cogvlm-chat-v1.1/1/mp_rank_00_model_states.pt
[2024-07-18 15:05:43,077] [INFO] [RANK 0] > successfully loaded /GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models/cogvlm-chat-v1.1/1/mp_rank_00_model_states.pt
[2024-07-18 15:05:44,114] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-07-18 15:05:44,864] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-07-18 15:05:45,654] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-07-18 15:05:46,351] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-07-18 15:05:47,077] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-07-18 15:05:47,871] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-07-18 15:05:48,692] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-07-18 15:05:49,551] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-07-18 15:05:50,375] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-07-18 15:05:51,153] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-07-18 15:05:51,949] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-07-18 15:05:52,892] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-07-18 15:05:53,677] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-07-18 15:05:54,587] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-07-18 15:05:55,295] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-07-18 15:05:56,079] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-07-18 15:05:56,938] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-07-18 15:05:57,762] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-07-18 15:05:58,654] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-07-18 15:05:59,468] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-07-18 15:06:00,300] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-07-18 15:06:01,055] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-07-18 15:06:02,043] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-07-18 15:06:02,786] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-07-18 15:06:03,570] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-07-18 15:06:04,406] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-07-18 15:06:05,249] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-07-18 15:06:06,080] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-07-18 15:06:06,862] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-07-18 15:06:08,048] [INFO] [RANK 0] replacing layer 29 attention with lora
[2024-07-18 15:06:08,829] [INFO] [RANK 0] replacing layer 30 attention with lora
[2024-07-18 15:06:09,577] [INFO] [RANK 0] replacing layer 31 attention with lora
[2024-07-18 15:06:10,367] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-07-18 15:06:10,480] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-07-18 15:06:10,589] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-07-18 15:06:10,832] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-07-18 15:06:11,036] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-07-18 15:06:11,243] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-07-18 15:06:11,437] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-07-18 15:06:11,644] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-07-18 15:06:11,851] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-07-18 15:06:12,125] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-07-18 15:06:12,333] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-07-18 15:06:12,469] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-07-18 15:06:12,655] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-07-18 15:06:12,857] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-07-18 15:06:13,064] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-07-18 15:06:13,325] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-07-18 15:06:13,541] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-07-18 15:06:13,763] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-07-18 15:06:14,028] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-07-18 15:06:14,241] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-07-18 15:06:14,443] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-07-18 15:06:14,642] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-07-18 15:06:14,843] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-07-18 15:06:15,035] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-07-18 15:06:15,226] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-07-18 15:06:15,443] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-07-18 15:06:15,626] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-07-18 15:06:15,832] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-07-18 15:06:15,997] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-07-18 15:06:16,190] [INFO] [RANK 0] replacing layer 29 attention with lora
[2024-07-18 15:06:16,437] [INFO] [RANK 0] replacing layer 30 attention with lora
[2024-07-18 15:06:16,639] [INFO] [RANK 0] replacing layer 31 attention with lora
[2024-07-18 15:06:16,846] [INFO] [RANK 0] replacing layer 32 attention with lora
[2024-07-18 15:06:17,052] [INFO] [RANK 0] replacing layer 33 attention with lora
[2024-07-18 15:06:17,250] [INFO] [RANK 0] replacing layer 34 attention with lora
[2024-07-18 15:06:17,453] [INFO] [RANK 0] replacing layer 35 attention with lora
[2024-07-18 15:06:17,652] [INFO] [RANK 0] replacing layer 36 attention with lora
[2024-07-18 15:06:17,926] [INFO] [RANK 0] replacing layer 37 attention with lora
[2024-07-18 15:06:18,139] [INFO] [RANK 0] replacing layer 38 attention with lora
[2024-07-18 15:06:18,348] [INFO] [RANK 0] replacing layer 39 attention with lora
[2024-07-18 15:06:18,540] [INFO] [RANK 0] replacing layer 40 attention with lora
[2024-07-18 15:06:18,741] [INFO] [RANK 0] replacing layer 41 attention with lora
[2024-07-18 15:06:18,934] [INFO] [RANK 0] replacing layer 42 attention with lora
[2024-07-18 15:06:19,126] [INFO] [RANK 0] replacing layer 43 attention with lora
[2024-07-18 15:06:19,346] [INFO] [RANK 0] replacing layer 44 attention with lora
[2024-07-18 15:06:19,545] [INFO] [RANK 0] replacing layer 45 attention with lora
[2024-07-18 15:06:19,745] [INFO] [RANK 0] replacing layer 46 attention with lora
[2024-07-18 15:06:19,930] [INFO] [RANK 0] replacing layer 47 attention with lora
[2024-07-18 15:06:20,122] [INFO] [RANK 0] replacing layer 48 attention with lora
[2024-07-18 15:06:20,327] [INFO] [RANK 0] replacing layer 49 attention with lora
[2024-07-18 15:06:20,534] [INFO] [RANK 0] replacing layer 50 attention with lora
[2024-07-18 15:06:20,733] [INFO] [RANK 0] replacing layer 51 attention with lora
[2024-07-18 15:06:20,970] [INFO] [RANK 0] replacing layer 52 attention with lora
[2024-07-18 15:06:21,163] [INFO] [RANK 0] replacing layer 53 attention with lora
[2024-07-18 15:06:21,424] [INFO] [RANK 0] replacing layer 54 attention with lora
[2024-07-18 15:06:21,643] [INFO] [RANK 0] replacing layer 55 attention with lora
[2024-07-18 15:06:21,842] [INFO] [RANK 0] replacing layer 56 attention with lora
[2024-07-18 15:06:22,030] [INFO] [RANK 0] replacing layer 57 attention with lora
[2024-07-18 15:06:22,230] [INFO] [RANK 0] replacing layer 58 attention with lora
[2024-07-18 15:06:22,433] [INFO] [RANK 0] replacing layer 59 attention with lora
[2024-07-18 15:06:22,580] [INFO] [RANK 0] replacing layer 60 attention with lora
[2024-07-18 15:06:22,780] [INFO] [RANK 0] replacing layer 61 attention with lora
[2024-07-18 15:06:23,041] [INFO] [RANK 0] replacing layer 62 attention with lora
[2024-07-18 15:06:23,776] [INFO] [RANK 0] find 0 files...
[2024-07-18 15:06:23,776] [INFO] [RANK 0] find 0 samples in all...
[rank3]: Traceback (most recent call last):
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank3]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank3]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank3]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank3]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank3]: ZeroDivisionError: integer division or modulo by zero
[rank0]: Traceback (most recent call last):
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank0]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank0]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank0]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank0]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank0]: ZeroDivisionError: integer division or modulo by zero
[rank2]: Traceback (most recent call last):
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank2]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank2]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank2]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank2]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank2]: ZeroDivisionError: integer division or modulo by zero
[rank1]: Traceback (most recent call last):
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank1]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank1]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank1]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank1]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank1]: ZeroDivisionError: integer division or modulo by zero
[2024-07-18 15:06:25,946] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56061
[2024-07-18 15:06:25,949] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56062
[2024-07-18 15:06:25,952] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56063
[2024-07-18 15:06:25,952] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56064
[2024-07-18 15:06:25,954] [ERROR] [launch.py:325:sigkill_handler] ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=3', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023'] exits with return code = 1

This place has a divisor of 0, I don't know how to solve it.
Could someone can help me?

Expected behavior / 期待表现

finetune is OK.

The text was updated successfully, but these errors were encountered:

Shawnzheng011019 · 2024-07-25T07:47:44Z

I have the same trouble and I dont know how to solve it as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a trouble in finetune chatglm: ZeroDivisionError: integer division or modulo by zero #509

a trouble in finetune chatglm: ZeroDivisionError: integer division or modulo by zero #509

Originlightwkp commented Jul 18, 2024 •

edited

Loading

Shawnzheng011019 commented Jul 25, 2024

a trouble in finetune chatglm: ZeroDivisionError: integer division or modulo by zero #509

a trouble in finetune chatglm: ZeroDivisionError: integer division or modulo by zero #509

Comments

Originlightwkp commented Jul 18, 2024 • edited Loading

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

加载captions.json文件

定义函数，根据图片文件名查找并返回描述

Tips: If training models of resolution 244, you can set --max_length smaller

Expected behavior / 期待表现

Shawnzheng011019 commented Jul 25, 2024

Originlightwkp commented Jul 18, 2024 •

edited

Loading