Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OkVQA Evaluation #40

Open
piyushkhanna00705 opened this issue Nov 28, 2023 · 1 comment
Open

OkVQA Evaluation #40

piyushkhanna00705 opened this issue Nov 28, 2023 · 1 comment

Comments

@piyushkhanna00705
Copy link

piyushkhanna00705 commented Nov 28, 2023

Thanks for the great work! I love how interpretable ViperGPT is! I am trying to evaluate the results on the OkVQA dataset, but I am facing a similar issue as Issue #24 , wherein the model generates the full answer instead of the specific (1-word) answer required for evaluating it as correct for the exact-match accuracy. I also tried being a bit "lenient" in calculating the accuracy by marking the prediction as correct if the answer word existed in the models' full-sentence predictions, however I still got an accuracy less than that reported in the paper.

Here are evaluation metrics from my experiments:
Exact-Match Accuracy (Wrong answer if prediction does not exactly match the answer): 9.435%
"Lenient" Accuracy (Correct answer if the answer word exists in the model's full length prediction): 21.62%

I am using GPT-3.5 for code generation and blip2-flan-t5-xl for visual queries. Could using blip2-flan-t5-xl instead of blip2-flan-t5-xxl resulted in such a high drop in accuracy, as I would have expected the "Lenient" Accuracy to be at least higher than the one reported in the paper as it may miscount a few answers as correct even though they aren't?

@surisdi
Copy link
Contributor

surisdi commented Dec 22, 2023

Hi, we updated the code with the evaluation code. Additionally, drop in performance can be expected if blip xl is used instead of xxl, and also if GPT-3.5 is used instead of Codex (which we used in our experiments). We did not run the experiments with GPT-3.5, so we do not have numbers about how much it affects not using Codex, but qualitatively GPT-3.5, is not as good (maybe it is just a matter of prompt engineering, as GPT-3.5 is not code-specific).

But I would suggest using our evaluation code in order to reduce the number of differences with respect to our experiments, so that we can narrow down the number of differences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants