Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python[patch]: evaluators can return primitives #1203

Merged
merged 6 commits into from
Nov 14, 2024

Conversation

baskaryan
Copy link
Contributor

@baskaryan baskaryan commented Nov 11, 2024

from langsmith import evaluate

def foo(run, example):
    return 0

def bar(run, example):
    return "long"

# removed, needs to be list of dict
# def baz(run, example):
#     return [0, 0.2, "how are ya"]

def app(inputs):
    return inputs

evaluate(app, data="Sample Dataset 3", evaluators=[foo, bar, baz])

@baskaryan
Copy link
Contributor Author

how ^ example gets logged

Screenshot 2024-11-11 at 10 38 17 AM

@hinthornw
Copy link
Collaborator

The list return type seems unexpected to me - I wouldn't expect it to make separate keys for each value

@baskaryan
Copy link
Contributor Author

The list return type seems unexpected to me - I wouldn't expect it to make separate keys for each value

are we able to support list values as feedback?

@hinthornw
Copy link
Collaborator

hinthornw commented Nov 11, 2024

The list return type seems unexpected to me - I wouldn't expect it to make separate keys for each value

are we able to support list values as feedback?

Only dict or string it seems.

Maybe doing the _ix suffix is preferable. I'm not sure honestly. Probalby less surprising than just logging duplicates to the same key.

It just messes with the experiment averages (we'd be averaging over index 1 and over index 2 )

@baskaryan
Copy link
Contributor Author

The list return type seems unexpected to me - I wouldn't expect it to make separate keys for each value

are we able to support list values as feedback?

Only dict or string it seems.

Maybe doing the _ix suffix is preferable. I'm not sure honestly. Probalby less surprising than just logging duplicates to the same key.

It just messes with the experiment averages (we'd be averaging over index 1 and over index 2 )

feel less strongly about list behavior either way, think main use case is supporting int/float/bool/str

@hinthornw
Copy link
Collaborator

Maybe let's land with the numeric and string value support but hold off on list behavior?

Maybe add support for list[evaluationresultlike] to make the "results" key not necessary

@baskaryan baskaryan marked this pull request as ready for review November 11, 2024 19:55
@baskaryan baskaryan changed the title rfc: evaluators can return primitives python[patch: evaluators can return primitives Nov 11, 2024
@baskaryan baskaryan changed the title python[patch: evaluators can return primitives python[patch]: evaluators can return primitives Nov 11, 2024
source_run_id: uuid.UUID,
) -> Union[EvaluationResult, EvaluationResults]:
if isinstance(result, EvaluationResult):
if isinstance(result, (bool, float, int)):
result = {"score": result}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I have four categories that are numbers for some reason (I'm classifying college class levels and they're 100 level, 200 level, 300 level, etc), then I should do str(value) to explicitly use categorical scores?

Copy link
Contributor Author

@baskaryan baskaryan Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or return as {"value": 200} (something for us to clearly document)

ordering_of_stuff.append("evaluate")
return "good"

async def eval_list(run, example):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe good to confirm that a list of ints, for example, doesn't work?

@@ -260,32 +259,46 @@ def _coerce_evaluation_results(
cp = results.copy()
cp["results"] = [
self._coerce_evaluation_result(r, source_run_id=source_run_id)
for r in results["results"]
for i, r in enumerate(results["results"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is i needed anymore?

Copy link
Contributor

@agola11 agola11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth holding off on the categorical dict case for now and just interpreting strings as categories. We need to think about the UX for allowing users to specify configuration for the feedback users are sending with their evaluators

@baskaryan baskaryan merged commit d8adcde into main Nov 14, 2024
9 checks passed
@baskaryan baskaryan deleted the bagatur/rfc_eval_simple_returns branch November 14, 2024 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants