Comparing Essays using GPT3.5 #23
Cormanz
started this conversation in
Experiments
Replies: 1 comment 1 reply
-
Did you try having gpt4 grade the essays? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The Problem
Something I've been thinking about is a "good enough way" to quickly compare the outputs of two versions of SmartGPT. My plan is to create a set of tasks for the AI, of five types:
Then, we would evaluate it either on explicit correctness where applicable; and otherwise on what another LLM (GPT3.5) thinks when comparing the two responses.
To quickly compare this, I gave both GPT3.5 and GPT4 the question of
Write a 150-word MLA-format 4-source essay on M&M health detriments. Ensure high quality.
in Assistant Mode, using SmartGPT with just one employee, no managers. Then, I used a prompt to have ChatGPT (GPT3.5, not using SmartGPT) compare the two essays. The results are surprising, and honestly disappointing.Essays
GPT3.5 Essay:
GPT4 Essay:
Results
For the first three rounds, GPT3.5 is Essay One, GPT4 is Essay Two. This is because I wanted to minimize the bias of having one essay be first or last.
Round One:
GPT3.5 Scores: 5, 4, 4, 3 (17)
GPT4 Scores: 5, 6, 4, 4 (21)
Round Two:
GPT3.5 Scores: 7, 6, 7, 6 (26)
GPT4 Scores: 9, 8, 9, 9 (35)
Round Three:
GPT3.5 Scores: 6, 8, 5, 5 (24)
GPT4 Scores: 6, 6, 5, 5 (22)
For the next three rounds, GPT4 is Essay One, GPT3.5 is Essay Two in the prompt.
Round Four:
GPT 4 Scores: 5, 8, 4, 3 (20)
GPT3.5 Scores: 5, 7, 5, 3 (20)
Round Five:
GPT4 Scores: 5, 8, 6, 7 (24)
GPT3.5 Scores: 8, 8, 9, 8 (33)
Round Six:
GPT4: 6, 8, 6, 5 (25)
GPT3.5 Scores: 6, 8, 8, 7 (29)
Averaged Scores across All Rounds:
GPT3.5: 24.8
GPT4: 24.5
Reflection
I think this experiment was eye-opening to me, GPT3.5 isn't capable of comparing essays by itself in a consistent way. It constantly favors one or the other, and in reading some of its responses, a lot of it feels misjudged or made-up, like it's reasoning about a non-existent essay.
I hope a better (automated) way exists to use LLMs to compare the quality of two essays.
Prompt
Beta Was this translation helpful? Give feedback.
All reactions