Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem running Test_only mode #24

Open
stasj145 opened this issue Oct 20, 2022 · 4 comments
Open

Problem running Test_only mode #24

stasj145 opened this issue Oct 20, 2022 · 4 comments

Comments

@stasj145
Copy link

Hi George, really like the project! I have been trying it out for a couple weeks now, training multiple models including some with my own datasets. However during all this time, while training works without any problems, i have not been able to get the test_only mode running. I continue to get this error:
per_batch['predictions'].append(predictions.cpu().numpy()) RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.

I have used the following commands:
Training:
python src/main.py --output_dir .\experiments --comment "regression from Scratch" --name custom_regression --records_file Regression_records.xls --data_dir ..\Datasets\CUSTOM --data_class tsra --pattern TRAIN --val_pattern TEST --epochs 100 --lr 0.001 --optimizer RAdam --pos_encoding learnable --task regression

Testing (not working):
python src/main.py --output_dir .\experiments --comment "regression from Scratch" --name Custom_regression --records_file Regression_records.xls --data_dir ..\Datasets\CUSTOM --data_class tsra --pattern TRAIN --val_pattern TEST --epochs 100 --lr 0.001 --optimizer RAdam --pos_encoding learnable --task regression --test_pattern TEST --test_only testset --load_model ./experiments/custom_regression_2022-10-20_17-05-04_MjH/checkpoints/model_best.pth

I have also tried the exact commands mentioned in this issue, which seem to work for the user that opened that issue, yet i still get the same error.

I have tested with both python 3.7 and 3.8 with the normal requirements.txt as well as the failsafe_requirements.txt. (using anaconda)

At this point i am unsure what i am doing wrong and what else to try to get the test_only mode working.

@gzerveas
Copy link
Owner

gzerveas commented Oct 20, 2022

Hi,

Thanks for discovering this bug! I am not sure how come this was working before and not now (maybe a combination of the specific configuration you tried and how different torch versions handle things), but the solution is thankfully very simple. The problem with the existing code is that the output nodes are still part of the computational graph that is used for backpropagating loss gradients (although this is not actually used here, we don't want to update parameters, we only use predictions for evaluation).
There are two ways of fixing this. The best way is to set the context with torch.no_grad(): to wrap the whole for loop of the model evaluation above line 331 and line 445 like this:

with torch.no_grad():
        for i, batch in enumerate(self.dataloader):
            ...
            epoch_loss += batch_loss  # add total loss of batch

To keep it consistent with how validation is done, instead of changing the evaluate functions internally, you can also even more simply wrap the call in the main.py in line 196, like this:

with torch.no_grad():
        aggr_metrics_test, per_batch_test = test_evaluator.evaluate(keep_all=True)

This should be enough, but if for whatever reason it doesn't work, then you can use the second way: that is, the .detach() command instead of .cpu() to forcefully detach the output nodes from the computational graph, like this:
per_batch['predictions'].append(predictions.detach().numpy()).

I will push a fix sometime soon, but try it and let me know how it worked for you.

@stasj145
Copy link
Author

Thanks for the quick reply! I have now tried out your recommended fixes. For whatever reason your first idea of adding with torch.no_grad(): to the evaluate function didn't end up fixing the problem. This didn't surprise me that much as i had already tried something very similar to that on my own. But i don't really now why it didn't, because adding with torch.no_grad() to the main.py in line 196 as per your second idea fixed the problem.

I did end up running into another small issue after that fix though. Line 199: print_str += '{}: {:8f} | '.format(k, v). v was None for the k value epoch leading to one of those format none errors. I saw that you sometimes check for this with if v is not None: like in line line 177 of running.py, so i just added that.

With those changes the test_only mode now works flawlessly for me!

@jingzbu
Copy link

jingzbu commented Jan 16, 2023

@stasj145 Thanks. I encountered the same issues with my own data and solved with exactly the same fixes.

richarddli added a commit to richarddli/mvts_transformer that referenced this issue Aug 27, 2023
@richarddli
Copy link

I can confirm as well this fixes the issue. I've pushed the recommended changes to my fork here: https://github.com/richarddli/mvts_transformer/tree/sktime0.22, which also has some minor patches to run on modern sktime etc. (see this draft #56).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants