Problem running Test_only mode #24

stasj145 · 2022-10-20T16:18:30Z

Hi George, really like the project! I have been trying it out for a couple weeks now, training multiple models including some with my own datasets. However during all this time, while training works without any problems, i have not been able to get the test_only mode running. I continue to get this error:
per_batch['predictions'].append(predictions.cpu().numpy()) RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.

I have used the following commands:
Training:
python src/main.py --output_dir .\experiments --comment "regression from Scratch" --name custom_regression --records_file Regression_records.xls --data_dir ..\Datasets\CUSTOM --data_class tsra --pattern TRAIN --val_pattern TEST --epochs 100 --lr 0.001 --optimizer RAdam --pos_encoding learnable --task regression

Testing (not working):
python src/main.py --output_dir .\experiments --comment "regression from Scratch" --name Custom_regression --records_file Regression_records.xls --data_dir ..\Datasets\CUSTOM --data_class tsra --pattern TRAIN --val_pattern TEST --epochs 100 --lr 0.001 --optimizer RAdam --pos_encoding learnable --task regression --test_pattern TEST --test_only testset --load_model ./experiments/custom_regression_2022-10-20_17-05-04_MjH/checkpoints/model_best.pth

I have also tried the exact commands mentioned in this issue, which seem to work for the user that opened that issue, yet i still get the same error.

I have tested with both python 3.7 and 3.8 with the normal requirements.txt as well as the failsafe_requirements.txt. (using anaconda)

At this point i am unsure what i am doing wrong and what else to try to get the test_only mode working.

The text was updated successfully, but these errors were encountered:

gzerveas · 2022-10-20T16:59:43Z

Hi,

Thanks for discovering this bug! I am not sure how come this was working before and not now (maybe a combination of the specific configuration you tried and how different torch versions handle things), but the solution is thankfully very simple. The problem with the existing code is that the output nodes are still part of the computational graph that is used for backpropagating loss gradients (although this is not actually used here, we don't want to update parameters, we only use predictions for evaluation).
There are two ways of fixing this. The best way is to set the context with torch.no_grad(): to wrap the whole for loop of the model evaluation above line 331 and line 445 like this:

with torch.no_grad():
        for i, batch in enumerate(self.dataloader):
            ...
            epoch_loss += batch_loss  # add total loss of batch

To keep it consistent with how validation is done, instead of changing the evaluate functions internally, you can also even more simply wrap the call in the main.py in line 196, like this:

with torch.no_grad():
        aggr_metrics_test, per_batch_test = test_evaluator.evaluate(keep_all=True)

This should be enough, but if for whatever reason it doesn't work, then you can use the second way: that is, the .detach() command instead of .cpu() to forcefully detach the output nodes from the computational graph, like this:
per_batch['predictions'].append(predictions.detach().numpy()).

I will push a fix sometime soon, but try it and let me know how it worked for you.

stasj145 · 2022-10-21T14:20:31Z

Thanks for the quick reply! I have now tried out your recommended fixes. For whatever reason your first idea of adding with torch.no_grad(): to the evaluate function didn't end up fixing the problem. This didn't surprise me that much as i had already tried something very similar to that on my own. But i don't really now why it didn't, because adding with torch.no_grad() to the main.py in line 196 as per your second idea fixed the problem.

I did end up running into another small issue after that fix though. Line 199: print_str += '{}: {:8f} | '.format(k, v). v was None for the k value epoch leading to one of those format none errors. I saw that you sometimes check for this with if v is not None: like in line line 177 of running.py, so i just added that.

With those changes the test_only mode now works flawlessly for me!

jingzbu · 2023-01-16T12:20:23Z

@stasj145 Thanks. I encountered the same issues with my own data and solved with exactly the same fixes.

richarddli · 2023-08-27T12:33:33Z

I can confirm as well this fixes the issue. I've pushed the recommended changes to my fork here: https://github.com/richarddli/mvts_transformer/tree/sktime0.22, which also has some minor patches to run on modern sktime etc. (see this draft #56).

richarddli added a commit to richarddli/mvts_transformer that referenced this issue Aug 27, 2023

fix test_only mode see gzerveas#24

84af512

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem running Test_only mode #24

Problem running Test_only mode #24

stasj145 commented Oct 20, 2022

gzerveas commented Oct 20, 2022 •

edited

Loading

stasj145 commented Oct 21, 2022

jingzbu commented Jan 16, 2023 •

edited

Loading

richarddli commented Aug 27, 2023

Problem running Test_only mode #24

Problem running Test_only mode #24

Comments

stasj145 commented Oct 20, 2022

gzerveas commented Oct 20, 2022 • edited Loading

stasj145 commented Oct 21, 2022

jingzbu commented Jan 16, 2023 • edited Loading

richarddli commented Aug 27, 2023

gzerveas commented Oct 20, 2022 •

edited

Loading

jingzbu commented Jan 16, 2023 •

edited

Loading