Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python[patch]: evaluators can return primitives #1203

Merged
merged 6 commits into from
Nov 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion python/langsmith/evaluation/_runner.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""V2 Evaluation Interface."""

Check notice on line 1 in python/langsmith/evaluation/_runner.py

View workflow job for this annotation

GitHub Actions / benchmark

Benchmark results

......................................... create_5_000_run_trees: Mean +- std dev: 619 ms +- 51 ms ......................................... create_10_000_run_trees: Mean +- std dev: 1.21 sec +- 0.06 sec ......................................... create_20_000_run_trees: Mean +- std dev: 1.20 sec +- 0.06 sec ......................................... dumps_class_nested_py_branch_and_leaf_200x400: Mean +- std dev: 704 us +- 11 us ......................................... dumps_class_nested_py_leaf_50x100: Mean +- std dev: 25.7 ms +- 0.2 ms ......................................... dumps_class_nested_py_leaf_100x200: Mean +- std dev: 105 ms +- 3 ms ......................................... dumps_dataclass_nested_50x100: Mean +- std dev: 25.9 ms +- 0.6 ms ......................................... WARNING: the benchmark result may be unstable * the standard deviation (17.5 ms) is 25% of the mean (70.2 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. dumps_pydantic_nested_50x100: Mean +- std dev: 70.2 ms +- 17.5 ms ......................................... WARNING: the benchmark result may be unstable * the standard deviation (34.0 ms) is 15% of the mean (227 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. dumps_pydanticv1_nested_50x100: Mean +- std dev: 227 ms +- 34 ms

Check notice on line 1 in python/langsmith/evaluation/_runner.py

View workflow job for this annotation

GitHub Actions / benchmark

Comparison against main

+-----------------------------------+---------+-----------------------+ | Benchmark | main | changes | +===================================+=========+=======================+ | dumps_dataclass_nested_50x100 | 25.4 ms | 25.9 ms: 1.02x slower | +-----------------------------------+---------+-----------------------+ | dumps_class_nested_py_leaf_50x100 | 25.1 ms | 25.7 ms: 1.02x slower | +-----------------------------------+---------+-----------------------+ | Geometric mean | (ref) | 1.01x slower | +-----------------------------------+---------+-----------------------+ Benchmark hidden because not significant (7): create_5_000_run_trees, create_10_000_run_trees, create_20_000_run_trees, dumps_class_nested_py_branch_and_leaf_200x400, dumps_class_nested_py_leaf_100x200, dumps_pydanticv1_nested_50x100, dumps_pydantic_nested_50x100

from __future__ import annotations

Expand Down Expand Up @@ -1848,7 +1848,7 @@
try:
tree = ast.parse(python_code)
function_def = tree.body[0]
if not isinstance(function_def, ast.FunctionDef):
if not isinstance(function_def, (ast.FunctionDef, ast.AsyncFunctionDef)):
return []

variables = {}
Expand Down
41 changes: 27 additions & 14 deletions python/langsmith/evaluation/evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,9 +236,8 @@ def _coerce_evaluation_result(
"Expected an EvaluationResult object, or dict with a metric"
f" 'key' and optional 'score'; got empty result: {result}"
)
if "key" not in result:
if allow_no_key:
result["key"] = self._name
if "key" not in result and allow_no_key:
result["key"] = self._name
if all(k not in result for k in ("score", "value", "comment")):
raise ValueError(
"Expected an EvaluationResult object, or dict with a metric"
Expand All @@ -265,27 +264,41 @@ def _coerce_evaluation_results(
return EvaluationResults(**cp)

return self._coerce_evaluation_result(
cast(dict, results), allow_no_key=True, source_run_id=source_run_id
cast(dict, results), source_run_id=source_run_id, allow_no_key=True
)

def _format_result(
self,
result: Union[EvaluationResult, EvaluationResults, dict],
result: Union[
EvaluationResult, EvaluationResults, dict, str, int, bool, float, list
],
source_run_id: uuid.UUID,
) -> Union[EvaluationResult, EvaluationResults]:
if isinstance(result, EvaluationResult):
if isinstance(result, (bool, float, int)):
result = {"score": result}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I have four categories that are numbers for some reason (I'm classifying college class levels and they're 100 level, 200 level, 300 level, etc), then I should do str(value) to explicitly use categorical scores?

Copy link
Contributor Author

@baskaryan baskaryan Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or return as {"value": 200} (something for us to clearly document)

elif not result:
raise ValueError(
f"Expected a non-empty dict, str, bool, int, float, list, "
f"EvaluationResult, or EvaluationResults. Got {result}"
)
elif isinstance(result, EvaluationResult):
if not result.source_run_id:
result.source_run_id = source_run_id
return result
if not result:
raise ValueError(
"Expected an EvaluationResult or EvaluationResults object, or a"
" dict with key and one of score or value, EvaluationResults,"
f" got {result}"
)
if not isinstance(result, dict):
elif isinstance(result, list):
if not all(isinstance(x, dict) for x in result):
raise ValueError(
f"Expected a list of dicts or EvaluationResult. Received {result}."
)
result = {"results": result} # type: ignore[misc]
elif isinstance(result, str):
result = {"value": result}
elif isinstance(result, dict):
pass
else:
raise ValueError(
f"Expected a dict, EvaluationResult, or EvaluationResults, got {result}"
f"Expected a dict, str, bool, int, float, list, EvaluationResult, or "
f"EvaluationResults. Got {result}"
)

return self._coerce_evaluation_results(result, source_run_id)
Expand Down
33 changes: 28 additions & 5 deletions python/tests/unit_tests/evaluation/test_evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -321,8 +321,8 @@ async def sample_evaluator(
assert result["results"][1].score == 2.0


@pytest.mark.parametrize("response", [None, {}, {"accuracy": 5}])
async def test_evaluator_raises_for_null_ouput(response: Any):
@pytest.mark.parametrize("response", [None, {}, []])
async def test_evaluator_raises_for_null_output(response: Any):
@run_evaluator # type: ignore
def bad_evaluator(run: schemas.Run, example: schemas.Example):
return response
Expand All @@ -334,13 +334,36 @@ async def abad_evaluator(run: schemas.Run, example: schemas.Example):
fake_run = MagicMock()
fake_example = MagicMock()

with pytest.raises(ValueError, match="Expected an EvaluationResult "):
with pytest.raises(ValueError, match="Expected a non-empty "):
bad_evaluator.evaluate_run(fake_run, fake_example)

with pytest.raises(ValueError, match="Expected an EvaluationResult "):
with pytest.raises(ValueError, match="Expected a non-empty "):
await bad_evaluator.aevaluate_run(fake_run, fake_example)

with pytest.raises(ValueError, match="Expected an EvaluationResult "):
with pytest.raises(ValueError, match="Expected a non-empty "):
await abad_evaluator.aevaluate_run(fake_run, fake_example)


@pytest.mark.parametrize("response", [[5], {"accuracy": 5}])
async def test_evaluator_raises_for_bad_output(response: Any):
@run_evaluator # type: ignore
def bad_evaluator(run: schemas.Run, example: schemas.Example):
return response

@run_evaluator # type: ignore
async def abad_evaluator(run: schemas.Run, example: schemas.Example):
return response

fake_run = MagicMock()
fake_example = MagicMock()

with pytest.raises(ValueError, match="Expected"):
bad_evaluator.evaluate_run(fake_run, fake_example)

with pytest.raises(ValueError, match="Expected"):
await bad_evaluator.aevaluate_run(fake_run, fake_example)

with pytest.raises(ValueError, match="Expected"):
await abad_evaluator.aevaluate_run(fake_run, fake_example)


Expand Down
78 changes: 70 additions & 8 deletions python/tests/unit_tests/evaluation/test_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,11 +196,26 @@ def score_value_first(run, example):
ordering_of_stuff.append("evaluate")
return {"score": 0.3}

def eval_float(run, example):
ordering_of_stuff.append("evaluate")
return 0.2

def eval_str(run, example):
ordering_of_stuff.append("evaluate")
return "good"

def eval_list(run, example):
ordering_of_stuff.append("evaluate")
return [
{"score": True, "key": "list_eval_bool"},
{"score": 1, "key": "list_eval_int"},
]

results = evaluate(
predict,
client=client,
data=dev_split,
evaluators=[score_value_first],
evaluators=[score_value_first, eval_float, eval_str, eval_list],
num_repetitions=NUM_REPETITIONS,
blocking=blocking,
)
Expand Down Expand Up @@ -231,14 +246,14 @@ def score_value_first(run, example):
assert fake_request.created_session
_wait_until(lambda: fake_request.runs)
N_PREDS = SPLIT_SIZE * NUM_REPETITIONS
_wait_until(lambda: len(ordering_of_stuff) == N_PREDS * 2)
_wait_until(lambda: len(ordering_of_stuff) == N_PREDS * 5)
_wait_until(lambda: slow_index is not None)
# Want it to be interleaved
assert ordering_of_stuff != ["predict"] * N_PREDS + ["evaluate"] * N_PREDS
assert ordering_of_stuff[:N_PREDS] != ["predict"] * N_PREDS

# It's delayed, so it'll be the penultimate event
# Will run all other preds and evals, then this, then the last eval
assert slow_index == (N_PREDS * 2) - 2
assert slow_index == (N_PREDS - 1) * 5

def score_value(run, example):
return {"score": 0.7}
Expand All @@ -260,6 +275,22 @@ def score_value(run, example):
assert r["run"].reference_example_id in dev_xample_ids
assert not fake_request.should_fail

# Returning list of non-dicts not supported.
def bad_eval_list(run, example):
ordering_of_stuff.append("evaluate")
return ["foo", 1]

results = evaluate(
predict,
client=client,
data=dev_split,
evaluators=[bad_eval_list],
num_repetitions=NUM_REPETITIONS,
blocking=blocking,
)
for r in results:
assert r["evaluation_results"]["results"][0].extra == {"error": True}


def test_evaluate_raises_for_async():
async def my_func(inputs: dict):
Expand Down Expand Up @@ -366,11 +397,26 @@ async def score_value_first(run, example):
ordering_of_stuff.append("evaluate")
return {"score": 0.3}

async def eval_float(run, example):
ordering_of_stuff.append("evaluate")
return 0.2

async def eval_str(run, example):
ordering_of_stuff.append("evaluate")
return "good"

async def eval_list(run, example):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe good to confirm that a list of ints, for example, doesn't work?

ordering_of_stuff.append("evaluate")
return [
{"score": True, "key": "list_eval_bool"},
{"score": 1, "key": "list_eval_int"},
]

results = await aevaluate(
predict,
client=client,
data=dev_split,
evaluators=[score_value_first],
evaluators=[score_value_first, eval_float, eval_str, eval_list],
num_repetitions=NUM_REPETITIONS,
blocking=blocking,
)
Expand Down Expand Up @@ -406,14 +452,14 @@ async def score_value_first(run, example):
assert fake_request.created_session
_wait_until(lambda: fake_request.runs)
N_PREDS = SPLIT_SIZE * NUM_REPETITIONS
_wait_until(lambda: len(ordering_of_stuff) == N_PREDS * 2)
_wait_until(lambda: len(ordering_of_stuff) == N_PREDS * 5)
_wait_until(lambda: slow_index is not None)
# Want it to be interleaved
assert ordering_of_stuff != ["predict"] * N_PREDS + ["evaluate"] * N_PREDS
assert ordering_of_stuff[:N_PREDS] != ["predict"] * N_PREDS
assert slow_index is not None
# It's delayed, so it'll be the penultimate event
# Will run all other preds and evals, then this, then the last eval
assert slow_index == (N_PREDS * 2) - 2
assert slow_index == (N_PREDS - 1) * 5

assert fake_request.created_session["name"]

Expand All @@ -431,3 +477,19 @@ async def score_value(run, example):
assert r["evaluation_results"]["results"][0].score == 0.7
assert r["run"].reference_example_id in dev_xample_ids
assert not fake_request.should_fail
# Returning list of non-dicts not supported.

async def bad_eval_list(run, example):
ordering_of_stuff.append("evaluate")
return ["foo", 1]

results = await aevaluate(
predict,
client=client,
data=dev_split,
evaluators=[bad_eval_list],
num_repetitions=NUM_REPETITIONS,
blocking=blocking,
)
async for r in results:
assert r["evaluation_results"]["results"][0].extra == {"error": True}
Loading