Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not reproduce Virchow2 segmentation results #733

Open
afilt opened this issue Dec 10, 2024 · 5 comments
Open

Can not reproduce Virchow2 segmentation results #733

afilt opened this issue Dec 10, 2024 · 5 comments

Comments

@afilt
Copy link

afilt commented Dec 10, 2024

Hello !
I'm currently benchmarking some in-house models with eva, which is working very smoothly 🥇
I get quite low Dice scores on segmentation tasks however (even below Lunit) on Consep and MoNuSAC.
I tried to reproduce Virchow2 results on the screenshot shown here on your website to see whether my models are actually under-performing.

For Virchow2 on Consep I get a G-Dice 0.693 (0.001) instead of 0.723 on the screenshot.
For Virchow2 on MoNuSAC I get a G-Dice 0.594 (0.006) instead of 0.713 on the screenshot.
Which was eva version / main commit when results were generated ?

I'm using offline/segmentation configurations, should I switch to online/segmentation ? FYI, I didn't change the configurations at all and forked the repo last Saturday (commit d0f5a03). Thanks for your answer !

cc @ioangatop @roman807 Could you maybe run it on your side ?

To reproduce (directly taken from here):

TASK="consep"
MODEL_NAME="pathology/paige_virchow2" \
NORMALIZE_MEAN="[0.485,0.456,0.406]" \
NORMALIZE_STD="[0.229,0.224,0.225]" \
IN_FEATURES=1280 \
eva predict_fit --config configs/vision/pathology/offline/segmentation/${TASK}.yaml
@nkaenzig
Copy link
Collaborator

Hi @afilt,

There are two reasons for the different results, both linked to recent changes:

  1. In a recent PR we updated the dice metric, which to lower metric values in general, because the metric implementation/definition is slightly different, see Replace GeneralizedDiceScore by DiceScore & fix class-wise metrics #719. There is a pending PR to update the leaderboard in the docs, which hasn't been merged yet.

  2. For segmentation, in the leaderboard we used the online configs i.e. eva fit --config configs/vision/pathology/online/segmentation/${TASK}.yaml, as for segmentation tasks we decided to also add the original image in addition to the last ViT feature map as input to the decoder, to make the evaluation less sensitive to the patch-size of the chosen ViT architecture.

In the meanwhile, you can either:
a. Continue using the version from main, run eva fit with the online config as mentioned in 2. above, and compare against the board in this PR: #734
b. Install version 0.1.6 of eva and use in conjunction with the .yaml configs before this PR was merged.

I'll open an issue to update the documentation & instructions to reproduce the segmentation metrics in the leaderboard, sorry for the confusion.

@afilt
Copy link
Author

afilt commented Dec 23, 2024

Hello @nkaenzig,

Thank you for taking the time and providing those details, it is very clear.
Considering your PR #734 will be merged in the coming days or weeks and will update the general main leaderboard, I will run the segmentation task with the current main and use online configurations (option "a"). Does it mean the command to reproduce the leaderboard results from PR #734 is simply (for instance for Virchow2):

TASK="consep"
MODEL_NAME="pathology/paige_virchow2" \
NORMALIZE_MEAN="[0.485,0.456,0.406]" \
NORMALIZE_STD="[0.229,0.224,0.225]" \
IN_FEATURES=1280 \
eva predict_fit --config configs/vision/pathology/online/segmentation/${TASK}.yaml

? Thanks a lot !

Last question: how much time do you expect the online configuration to run (e.g. for a ViT-Base) ?

@nkaenzig
Copy link
Collaborator

Hi @afilt,

Does it mean the command to reproduce the leaderboard results from PR #734 is simply (for instance for Virchow2):

Yes, almost, you just need to replace eva predict_fit by eva fit. For the online configs the predict step is not necessary, because the embeddings are generated on the fly ("online") during fit.

The runtimes depend a lot on the size of the ViT architecture & hardware being used. I'd guess that an online evaluation for consep for instance using a ViT-B16 shouldn't take more than 30min running on a A100.

@afilt
Copy link
Author

afilt commented Dec 23, 2024

Thank you @nkaenzig !
Regarding slide/tile classification, were there some major changes since main commit d0f5a03 (seems not) ? Just want to know if I should run the benchmarks once again. Thanks !

@nkaenzig
Copy link
Collaborator

No major changes for slide/tile classification, only segmentation :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants