v1.3 fine tuning duration too short #516

junsukha · 2024-10-29T07:45:50Z

Hi,

I'm fine-tuning v1.3 any93x640x640 (https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.3.0/tree/main/any93x640x640) with the videos of 352x640 (height, width), fps 16.

I see that 1 epoch (93 steps) takes only around 4 minutes. Is this expected? I think it takes too short amount of time.
I'm using 2 A100 gpus. 1 batch size per gpu.

Below I provide a part of json that consists of video data.

[
    {
        "path": "/gpfs/vision/drag_video/0_datasets/open-sora-plan/videos/car-centric/encoded/모닝1-seg11.mp4",
        "cap": "A family is seen standing together outdoors, followed by a sleek white car driving smoothly across a modern bridge. The car is highlighted as the \"All New Smart Compact Morning.\"",
        "resolution": {
            "height": 352,
            "width": 640
        },
        "num_frames": 66,
        "fps": 16
    },
    {
        "path": "/gpfs/vision/drag_video/0_datasets/open-sora-plan/videos/car-centric/encoded/베뉴1-seg48.mp4",
        "cap": "Two cars are seen driving on a dimly lit road, with one car passing the other. The scene transitions to a wide shot of a car driving towards a city skyline at dusk, highlighting the vehicle's rear design and branding.",
        "resolution": {
            "height": 352,
            "width": 640
        },
        "num_frames": 78,
        "fps": 16
    },
    {
        "path": "/gpfs/vision/drag_video/0_datasets/open-sora-plan/videos/car-centric/encoded/모닝1-seg05.mp4",
        "cap": "The commercial showcases a sleek, white Kia Morning car, highlighting its modern design and stylish features as it drives through an urban environment. The tagline \"happy new morning\" emphasizes a fresh and positive start with this vehicle.",
        "resolution": {
            "height": 352,
            "width": 640
        },
        "num_frames": 53,
        "fps": 16
    },
    {
        "path": "/gpfs/vision/drag_video/0_datasets/open-sora-plan/videos/car-centric/encoded/k7_1-seg5.mp4",
        "cap": "A sleek, dark-colored sedan is showcased driving smoothly on a modern bridge, highlighting its elegant design and emphasizing its award for being ranked first in the 2014 J.D. Power Initial Quality Study for large cars.",
        "resolution": {
            "height": 352,
            "width": 640
        },
        "num_frames": 36,
        "fps": 16
    },
...
]

Below is the part of the output in the terminal during the training process.

too_long: 25, too_short: 50
cnt_img_res_mismatch_stride: 0, cnt_vid_res_mismatch_stride: 0
cnt_img_res_too_small: 0, cnt_vid_res_too_small: 0
cnt_img_aspect_mismatch: 0, cnt_vid_aspect_mismatch: 0
cnt_filter_minority: 0
Counter(sample_size): Counter({'33x352x640': 170, '29x352x640': 17})


10/29/2024 02:26:11 - INFO - __main__ -   Num examples = 187
10/29/2024 02:26:11 - INFO - __main__ -   Num Epochs = 1000
10/29/2024 02:26:11 - INFO - __main__ -   Instantaneous batch size per device = 1
10/29/2024 02:26:11 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 2
10/29/2024 02:26:11 - INFO - __main__ -   Gradient Accumulation steps = 1
10/29/2024 02:26:11 - INFO - __main__ -   Total optimization steps = 93000
10/29/2024 02:26:11 - INFO - __main__ -   Total optimization steps (num_update_steps_per_epoch) = 93
10/29/2024 02:26:11 - INFO - __main__ -   Total training parameters = 2.7719816 B
10/29/2024 02:26:11 - INFO - __main__ -   AutoEncoder = WFVAEModel_D8_4x8x8; Dtype = torch.bfloat16; Parameters = 0.147347724 B
10/29/2024 02:26:11 - INFO - __main__ -   Text_enc_1 = /mnt/singularity_home/jsha/repos/Open-Sora-Plan/weights/google/mt5-xxl; Dtype = torch.bfloat16; Parameters = 5.65517312 B
Checkpoint 'latest' does not exist. Starting a new training run.

Below is the arguments I used:

            "args": [
                "--config_file",
                "scripts/accelerate_configs/deepspeed_zero2_config.yaml",
                "opensora/train/train_t2v_diffusers.py",
                "--model=OpenSoraT2V_v1_3-2B/122",
                "--text_encoder_name_1=/mnt/singularity_home/jsha/repos/Open-Sora-Plan/weights/google/mt5-xxl",
                "--cache_dir=../../cache_dir/",
                "--dataset=t2v",
                "--data=/mnt/singularity_home/jsha/repos/Open-Sora-Plan/fine_tuning/data.txt",
                "--ae=WFVAEModel_D8_4x8x8",
                "--ae_path",
                "/gpfs/vision/drag_video/HF_downloads/Open-Sora-Plan-v1.3.0/vae",
                "--sample_rate",
                "1",
                "--num_frames",
                "33",
                "--max_height",
                "352",
                "--max_width",
                "640",
                "--interpolation_scale_t",
                "1.0",
                "--interpolation_scale_h",
                "1.0",
                "--interpolation_scale_w",
                "1.0",
                "--gradient_checkpointing",
                "--train_batch_size",
                "1",
                "--dataloader_num_workers",
                "16",
                "--gradient_accumulation_steps",
                "1",
                // "--max_train_steps","100" ,
                "--learning_rate",
                "1e-5",
                "--lr_scheduler",
                "constant",
                "--lr_warmup_steps",
                "0",
                "--mixed_precision=bf16",
                "--report_to=tensorboard",
                "--checkpointing_steps=500",
                "--allow_tf32",
                "--model_max_length",
                "512",
                "--use_ema",
                "--ema_start_step",
                "0",
                "--cfg",
                " 0.1",
                "--resume_from_checkpoint=latest",
                "--speed_factor",
                "1.0",
                "--ema_decay",
                " 0.9999",
                "--drop_short_ratio",
                "0.0",
                // "--pretrained",
                // "",
                "--hw_stride",
                "32",
                "--sparse1d",
                "--sparse_n",
                "4",
                "--train_fps",
                "16",
                "--seed",
                "1234",
                "--trained_data_global_step",
                "0",
                "--group_data",
                "--use_decord",
                "--prediction_type",
                "v_prediction",
                "--snr_gamma",
                "5.0",
                "--force_resolution",
                "--rescale_betas_zero_snr",
                "--output_dir",
                "/mnt/singularity_home/jsha/repos/Open-Sora-Plan/output/fine_tuning/encoded-videos",
                "--pretrained=/mnt/singularity_home/jsha/repos/Open-Sora-Plan/weights/Open-Sora-Plan-v1.3.0/any93x640x640",
                "--num_train_epochs=1000",
                "--checkpoints_total_limit=10"
                // "--sp_size=2", 
                // "--train_sp_batch_size=1"
            ],

The text was updated successfully, but these errors were encountered:

LinB203 · 2024-10-29T11:20:04Z

I think that's a normal, normally it takes 4s for 93x352x640 , and your videos are more shorter.

junsukha · 2024-10-29T13:49:37Z

@LinB203 thx for the reply!
4s per step for for 93x352x640? I see.

In my case,
one step in the training phase means processing two videos or data samples simultaneously as I'm using two GPUs and 1 batch size per GPU. So it takes 2.xx secs per step or video (240 secs / 93 steps = 2.xx secs / step. because 1 epoch (93 steps) takes only around 4 minutes as I mentioned before).
(If this doesn't make sense, just ignore it. I think I explained it poorly. or please correct me if I'm wrong)

My question is:
But when I sample a video using the below config, it normally takes around 1-2 minutes to generate a video. Why does it take so long for inference compared to the training phase where, I think, it takes 2.xx seconds per video?

CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1--master_port 29514 \
    -m opensora.sample.sample \
    --model_path path_to_check_point_model_ema \
    --version v1_3 \
    --num_frames 33 \
    --height 352 \
    --width 640 \
    --cache_dir "../cache_dir" \
    --text_encoder_name_1 "/storage/ongoing/new/Open-Sora-Plan/cache_dir/mt5-xxl" \
    --text_prompt "examples/prompt.txt" \
    --ae WFVAEModel_D8_4x8x8 \
    --ae_path "/storage/lcm/WF-VAE/results/latent8" \
    --save_img_path "./train_1_3_nomotion_fps18" \
    --fps 16 \
    --guidance_scale 7.5 \
    --num_sampling_steps 100 \
    --max_sequence_length 512 \
    --sample_method EulerAncestralDiscrete \
    --seed 1234 \
    --num_samples_per_prompt 1 \
    --rescale_betas_zero_snr \
    --prediction_type "v_prediction"

LinB203 · 2024-10-30T04:22:39Z

You can see --num_sampling_steps 100, which mean use 100 step to generate videos.

junsukha · 2024-10-30T04:30:45Z

You can see --num_sampling_steps 100, which mean use 100 step to generate videos.

@LinB203 thx for the reply!

num_sampling_steps is I believe the denoising steps. But doesn't the step in the training phase have different meaning? The step I'm referring in the training phase is this:

That's the fine tuning progress bar.

So you're saying one step in the training phase (the image I've attached above that shows the progress bar) is basically one denoising step? just like as the sampling step (parameter --num_sampling_steps) in the inference phase?

The output Total optimization steps (num_update_steps_per_epoch) = 93 means that, I think, it takes 93 steps to use all the input videos once for training. If step here is referring to denoising step, I don't think it make sense because it only requires 93 denoising steps while using all the input videos once for training?

UPDATE

oh I think I got it. U right.
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps) from train_t2v_diffusers.py says num_updates_per_epoch is the number of batches (len(train_dataloader)) as I'm using 1 for args.gradient_accumulation_steps.
Since I'm using 2 gpus and 1 batch size per gpu, my batch size is 2 in total. So num_examples (183) / total batch size (2) gives 93. So num_updates_per_epoch is 93.

Also, there's one step training per batch according to the code (I think), which makes 93 steps in total per epoch (93 batches in an epoch). One step here is denoising step for a given timestep. So one step in the training phase is basically the same as a step in the sampling phase (--num_sampling_steps).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.3 fine tuning duration too short #516

v1.3 fine tuning duration too short #516

junsukha commented Oct 29, 2024 •

edited

Loading

LinB203 commented Oct 29, 2024

junsukha commented Oct 29, 2024 •

edited

Loading

LinB203 commented Oct 30, 2024

junsukha commented Oct 30, 2024 •

edited

Loading

v1.3 fine tuning duration too short #516

v1.3 fine tuning duration too short #516

Comments

junsukha commented Oct 29, 2024 • edited Loading

LinB203 commented Oct 29, 2024

junsukha commented Oct 29, 2024 • edited Loading

LinB203 commented Oct 30, 2024

junsukha commented Oct 30, 2024 • edited Loading

UPDATE

junsukha commented Oct 29, 2024 •

edited

Loading

junsukha commented Oct 29, 2024 •

edited

Loading

junsukha commented Oct 30, 2024 •

edited

Loading