Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoT Evaluation code #1

Open
Coobiw opened this issue Jul 22, 2024 · 5 comments
Open

PoT Evaluation code #1

Coobiw opened this issue Jul 22, 2024 · 5 comments

Comments

@Coobiw
Copy link

Coobiw commented Jul 22, 2024

Hi, thanks for your great work! Will you release your PoT evaluation code or share some details about it on ChartQA test split? I want to reproduce this result. Thanks for your advice and reply!

@Coobiw
Copy link
Author

Coobiw commented Jul 22, 2024

After my own DIY as following code, chartgemma reaches 74.64 on ChartQA(64.0 human + 85.28 aug). Is there any question on my implementation? Thanks for your advice and reply!

        def execute_python_code(code):
            old_stdout = sys.stdout
            new_stdout = io.StringIO()
            sys.stdout = new_stdout

            status = True
            try:
                exec(code)
            except Exception as e:
                status = False
            finally:
                sys.stdout = old_stdout

            if status:
                output = new_stdout.getvalue()
            else:
                output = None
            return output, status
        response, status = execute_python_code(response)
        if status:
            answer = response
            print(answer)
        else:
            answer = ""
            print("error running...")

@AhmedMasryKU
Copy link
Collaborator

Hi @Coobiw

I am cleaning the remaining codebase and will try to release it when I get some time. However, here are some ideas that we used to optimize the performance on the validation set before running the model on the testing set:

  1. Use the following prompt: "program of thought:" + question.
  2. After executing the code and getting the output text, change "True" and "False" to "Yes" and "No".
  3. Clean the output string from unnecessary characters (\n, ')

Also, we used the following implementation of the evaluation metric: https://github.com/vis-nlp/UniChart/blob/bd6004bc8fe9ef8ce9a6cdfd88712f845d78b918/model/chartqa_model.py#L36

@Coobiw
Copy link
Author

Coobiw commented Jul 23, 2024

using the following code, it can reach 76.56(67.84 human + 85.28 aug):

        answer = answer.replace("True","Yes").replace("False","No")
        answer = answer.strip()

Clean the output string from unnecessary characters (\n, ')
Is strip() enough to do this?

@AhmedMasryKU
Copy link
Collaborator

I will clean up and share the code that you can use to reproduce the results by this weekend. Sorry, I am a bit busy today and tomorrow.

@Coobiw
Copy link
Author

Coobiw commented Jul 24, 2024

OK! Thanks for your help!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants