-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python[patch]: evaluators can return primitives #1203
Conversation
baskaryan
commented
Nov 11, 2024
•
edited
Loading
edited
The list return type seems unexpected to me - I wouldn't expect it to make separate keys for each value |
are we able to support list values as feedback? |
Only dict or string it seems. Maybe doing the _ix suffix is preferable. I'm not sure honestly. Probalby less surprising than just logging duplicates to the same key. It just messes with the experiment averages (we'd be averaging over index 1 and over index 2 ) |
feel less strongly about list behavior either way, think main use case is supporting int/float/bool/str |
Maybe let's land with the numeric and string value support but hold off on list behavior? Maybe add support for list[evaluationresultlike] to make the "results" key not necessary |
source_run_id: uuid.UUID, | ||
) -> Union[EvaluationResult, EvaluationResults]: | ||
if isinstance(result, EvaluationResult): | ||
if isinstance(result, (bool, float, int)): | ||
result = {"score": result} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I have four categories that are numbers for some reason (I'm classifying college class levels and they're 100 level, 200 level, 300 level, etc), then I should do str(value)
to explicitly use categorical scores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or return as {"value": 200}
(something for us to clearly document)
ordering_of_stuff.append("evaluate") | ||
return "good" | ||
|
||
async def eval_list(run, example): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe good to confirm that a list of ints, for example, doesn't work?
@@ -260,32 +259,46 @@ def _coerce_evaluation_results( | |||
cp = results.copy() | |||
cp["results"] = [ | |||
self._coerce_evaluation_result(r, source_run_id=source_run_id) | |||
for r in results["results"] | |||
for i, r in enumerate(results["results"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is i
needed anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth holding off on the categorical dict case for now and just interpreting strings as categories. We need to think about the UX for allowing users to specify configuration for the feedback users are sending with their evaluators