Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About parameter Settings during training #21

Open
Annmixiu opened this issue Jul 22, 2023 · 4 comments
Open

About parameter Settings during training #21

Annmixiu opened this issue Jul 22, 2023 · 4 comments

Comments

@Annmixiu
Copy link

前辈您好,首先感谢您提供的方法,我在回归模型上复现您的方法时,出现了较大的性能损失,可以向您请教几个问题吗?
(1)在加载预训练好的模型后,算法在边训练边压缩,这个过程中的训练/验证性能和loss如下图所示,在正常的模型训练中属于异常情况,请问下压缩阶段属于异常情况吗?
image
(2)在您提供的样例中,压缩达到目标压缩率后在设定的epoch之前仍将继续训练,这会在压缩的基础上有性能恢复效果吗?
十分感谢您的阅读和解答,祝您在工作和生活中一切顺利!

@gdh1995
Copy link
Collaborator

gdh1995 commented Jul 23, 2023

(1) 看起来剪坏了,建议把剪枝间隔设久一点试试,像 prune.py 里 resrep 方法的参数 prune_interval 是多少次迭代后就剪一次,不是 epoch,所以数据集规模大了建议把参数改大些。

(2) resrep吗?剪枝的那一次迭代会掉点,之后继续训一般还能恢复些。

@Annmixiu
Copy link
Author

(1)好的,谢谢前辈,您给的resrep示例中我看到是200次迭代进行一次压缩,我的数据量更多,我修改了您在训练时的部分代码如下:

begin epoch

    print("training...")
    for epoch in range(0, self.config["epoch"]):

        # setting lr
        if epoch <= self.config["warmup_epoch"]:
            lr = 0.005
        else:
            lr = 0.005 * (0.995 ** ((epoch - 1) // 2))
        self.config["lr"] = lr

        self.variable_dict["epoch"] = epoch
        self.run_hook(self.epoch_begin_hook)
        self.variable_dict["avg_mentor"] = AvgMeter()  # operate and update average
        self.model.train()
        # max_step = len(self.trainloader)
        # print("This is the max_step", max_step)
        for step, data in enumerate(tqdm(self.trainloader)):
            # tqdm.write("Step: {}".format(step))
            self.variable_dict["step"] = step + 1
            self.variable_dict["iteration"] += 1
            self.run_hook(self.iteration_begin_hook)  # record prune_iteration
            data = self._sample_to_device(data, self.variable_dict["base_device"])
            c_sample_number = self._get_sample_number(data)  # get the num_sample
            predict = self.config["predict_function"](self.model, data)
            self.variable_dict["loss"] = self.config["calculate_loss_function"](
                predict, data
            )
            self.on_loss_backward()
            self.variable_dict["loss"].backward()
            self.after_loss_backward()
            # add the gradient of penalty with decay to the gradient of precision parameter for the compactor
            self.optimizer.step()  # update parameter
            self.optimizer.zero_grad()  # zero out gradient
        evaluate_result = self.config["evaluate_function"](predict, data)
        evaluate_result["loss"] = self.variable_dict["loss"].item()  # get high accuracy loss
        self.variable_dict["avg_mentor"].update(
            evaluate_result, c_sample_number
        )  # update average
        # del evaluate_result, predict
        # if self.variable_dict["iteration"] % self.config["log_interval"] == 0:
        self.write_log(
            self.variable_dict["epoch"],
            self.variable_dict["iteration"],
            self.variable_dict["avg_mentor"].get(),
            self.config["lr"],
        )
        self.write_tensorboard(
            "train_log",
            self.variable_dict["iteration"],
            self.variable_dict["avg_mentor"].get(),
        )
        self.run_hook(self.iteration_end_hook)  # model prune, optimizer prune, compute flops
        del self.variable_dict["loss"]
        self.scheduler.step()  # adjust lr

        # evaluation/testing step
        print("evaluating...")
        self.variable_dict["test_avg_mentor"] = AvgMeter()
        self.model.eval()
        # for step, data in enumerate(self.testloader):
        #     self.variable_dict["step"] = step + 1
        #     data = self._sample_to_device(data, self.variable_dict["base_device"])
        #     c_sample_number = self._get_sample_number(data)
        #     predict = self.config["predict_function"](self.model, data)
        evaluate_result = self.config["evaluate_function"](predict, data)
        evaluate_result["loss"] = self.config["calculate_loss_function_test"](
            predict, data
        ).item()
        self.variable_dict["test_avg_mentor"].update(
            evaluate_result, c_sample_number
        )  # update precision result and num_sample
        if self.variable_dict["step"] % self.config["log_interval"] == 0:  # (useless at present) decide log_out_period, log_interval=20
            self.write_log(
                self.variable_dict["epoch"],
                self.variable_dict["step"],
                self.variable_dict["test_avg_mentor"].get(),
            )
        # del evaluate_result, predict
        self.write_log(
            self.variable_dict["epoch"],
            "final",
            self.variable_dict["test_avg_mentor"].get(),
            self.config["lr"],
        )
        self.write_tensorboard(
            "test_log",
            self.variable_dict["epoch"],
            self.variable_dict["test_avg_mentor"].get(),
        )

总结来说,由于语音方向回归模型数据量庞大,我没有按照您原来设置的迭代次数进行压缩,而是每个epoch压缩一次,那我改写下尝试2-3个epoch或者1.5个epoch等同的迭代次数来设置压缩频次。
(2)另外,想咨询下您,如果压缩正常,训练和验证的loss是否和正常训练模型一样以较平滑的曲线下降?
感谢前辈您的多次解答和帮助,祝您工作和生活一切顺利!

@gdh1995
Copy link
Collaborator

gdh1995 commented Aug 5, 2023

我只记得在cifar10上的resnet,训练期间验证集的精度不断涨,碰到剪枝就掉一点再继续涨,loss没印象了。

@Annmixiu
Copy link
Author

Annmixiu commented Aug 9, 2023

我只记得在cifar10上的resnet,训练期间验证集的精度不断涨,碰到剪枝就掉一点再继续涨,loss没印象了。

好的,谢谢前辈解答

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants