Comparing Essays using GPT3.5 #23

Cormanz · 2023-05-10T21:27:20Z

Cormanz
May 10, 2023
Maintainer

The Problem

Something I've been thinking about is a "good enough way" to quickly compare the outputs of two versions of SmartGPT. My plan is to create a set of tasks for the AI, of five types:

1. Informative Writing | Write an essay about [TOPIC] in MLA citation, at least four sources.
2. Creative Writing | Plan and write a poem about [TOPIC].
3. Simple Informative Answer | [QUESTION]?
4. Use of Symbolic Systems | [MATH EQUATION] 
5. Output-Based Reflection | Can you fix [PROGRAM] so it runs?

Then, we would evaluate it either on explicit correctness where applicable; and otherwise on what another LLM (GPT3.5) thinks when comparing the two responses.

To quickly compare this, I gave both GPT3.5 and GPT4 the question of Write a 150-word MLA-format 4-source essay on M&M health detriments. Ensure high quality. in Assistant Mode, using SmartGPT with just one employee, no managers. Then, I used a prompt to have ChatGPT (GPT3.5, not using SmartGPT) compare the two essays. The results are surprising, and honestly disappointing.

Essays

GPT3.5 Essay:

M&M’s contain artificial colorings and ingredients that can cause health problems such as hyperactivity and allergies. They can also harm dental health with regular consumption. Scientists have linked chemical dyes to hyperactivity and learning impairment in children and several colorings used in M&M’s are classified as possibly carcinogenic (Inner Strength Bodywork). Mars, the company that produces M&M’s, has come under scrutiny for using genetically modified ingredients in their products, which can pose health risks to humans (Westchester Magazine). Additionally, frequent consumption of M&M’s can cause tooth decay (Is It Bad For You?). Finally, some dyes used in M&M’s are derived from petroleum and can lead to hyperactivity and allergic reactions (Food Nerd Inc.). Therefore, it is important for consumers to be aware of the potential health detriments of M&M’s when consuming them (PubMed).

GPT4 Essay:

Health Detriments of M&M’s

M&M’s, the popular chocolate candies, contain several detrimental ingredients (Wolfe). They possess artificial dyes (Yellow #6 and Blue #1), which have been linked to health issues, such as hyperactivity, cancer, migraines, and allergic reactions (Wolfe; Isitbadforyou.com). Furthermore, these candies contain sugar that may result in chronic health problems (Isitbadforyou.com).

Artificial food coloring, made from petroleum, has raised health concerns like cancer, immune disruption, and hypersensitivity (Foodnerdinc.com). A historical example of the negative impact of food coloring was the FDA’s ban of FD&C Red No. 2 in the 1970s after its discovery of links to cancer (LiveScience).

In conclusion, M&M’s health risks result from harmful ingredients like artificial dyes, and high sugar content. Consumers should consider alternatives with natural coloring and less sugar to maintain better health.

Works Cited

Foodnerdinc.com. ‘Four Incredibly Harmful Effects Artificial Dyes Have On Our Health.’

Isitbadforyou.com. ‘Are M&M’s Bad For You? - Here Is Your Answer.’

LiveScience. ‘Why Were Red M&M’s Discontinued for a Decade?’

Wolfe, David. ‘2 Disturbing Reasons Why You Should Never Eat M&M’s Ever Again.’

Results

For the first three rounds, GPT3.5 is Essay One, GPT4 is Essay Two. This is because I wanted to minimize the bias of having one essay be first or last.

Round One:

GPT3.5 Scores: 5, 4, 4, 3 (17)
GPT4 Scores: 5, 6, 4, 4 (21)

Round Two:

GPT3.5 Scores: 7, 6, 7, 6 (26)
GPT4 Scores: 9, 8, 9, 9 (35)

Round Three:

GPT3.5 Scores: 6, 8, 5, 5 (24)
GPT4 Scores: 6, 6, 5, 5 (22)

For the next three rounds, GPT4 is Essay One, GPT3.5 is Essay Two in the prompt.

Round Four:

GPT 4 Scores: 5, 8, 4, 3 (20)
GPT3.5 Scores: 5, 7, 5, 3 (20)

Round Five:

GPT4 Scores: 5, 8, 6, 7 (24)
GPT3.5 Scores: 8, 8, 9, 8 (33)

Round Six:

GPT4: 6, 8, 6, 5 (25)
GPT3.5 Scores: 6, 8, 8, 7 (29)

Averaged Scores across All Rounds:

GPT3.5: 24.8
GPT4: 24.5

Reflection

I think this experiment was eye-opening to me, GPT3.5 isn't capable of comparing essays by itself in a consistent way. It constantly favors one or the other, and in reading some of its responses, a lot of it feels misjudged or made-up, like it's reasoning about a non-existent essay.

I hope a better (automated) way exists to use LLMs to compare the quality of two essays.

Prompt

Essay One:
[...]

Essay Two:
[...]

-- Scoring --

Factors: 
Accuracy & Reliability (Sources)
Organization & Clarity
Depth
Engagement

Each factor will be given a score out of 10.

You are a very critical grader. Focus on negatives, not positives.

For each factor, briefly discuss each essay individually, then focus deeply on the comparison.

Provide reasoning first, then your score.

abrambailey · 2023-06-02T20:25:26Z

abrambailey
Jun 2, 2023

Did you try having gpt4 grade the essays?

1 reply

Cormanz Jun 3, 2023
Maintainer Author

I didn't since I didn't have as reliable access to gpt4 at the time, might be worth trying though

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing Essays using GPT3.5 #23

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Comparing Essays using GPT3.5 #23

Cormanz May 10, 2023 Maintainer

The Problem

Essays

Results

Reflection

Prompt

Replies: 1 comment · 1 reply

abrambailey Jun 2, 2023

Cormanz Jun 3, 2023 Maintainer Author

Cormanz
May 10, 2023
Maintainer

Replies: 1 comment 1 reply

abrambailey
Jun 2, 2023

Cormanz Jun 3, 2023
Maintainer Author