Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python[patch]: accept simple evaluators #1200

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions python/langsmith/evaluation/_runner.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""V2 Evaluation Interface."""

Check notice on line 1 in python/langsmith/evaluation/_runner.py

View workflow job for this annotation

GitHub Actions / benchmark

Benchmark results

......................................... create_5_000_run_trees: Mean +- std dev: 613 ms +- 39 ms ......................................... create_10_000_run_trees: Mean +- std dev: 1.19 sec +- 0.05 sec ......................................... create_20_000_run_trees: Mean +- std dev: 1.18 sec +- 0.05 sec ......................................... dumps_class_nested_py_branch_and_leaf_200x400: Mean +- std dev: 706 us +- 8 us ......................................... dumps_class_nested_py_leaf_50x100: Mean +- std dev: 25.2 ms +- 0.3 ms ......................................... dumps_class_nested_py_leaf_100x200: Mean +- std dev: 105 ms +- 2 ms ......................................... dumps_dataclass_nested_50x100: Mean +- std dev: 25.8 ms +- 0.3 ms ......................................... WARNING: the benchmark result may be unstable * the standard deviation (15.7 ms) is 24% of the mean (66.3 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. dumps_pydantic_nested_50x100: Mean +- std dev: 66.3 ms +- 15.7 ms ......................................... WARNING: the benchmark result may be unstable * the standard deviation (29.6 ms) is 13% of the mean (220 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. dumps_pydanticv1_nested_50x100: Mean +- std dev: 220 ms +- 30 ms

Check notice on line 1 in python/langsmith/evaluation/_runner.py

View workflow job for this annotation

GitHub Actions / benchmark

Comparison against main

+-----------------------------------------------+--------+----------------------+ | Benchmark | main | changes | +===============================================+========+======================+ | dumps_class_nested_py_branch_and_leaf_200x400 | 703 us | 706 us: 1.00x slower | +-----------------------------------------------+--------+----------------------+ | Geometric mean | (ref) | 1.00x faster | +-----------------------------------------------+--------+----------------------+ Benchmark hidden because not significant (8): dumps_pydantic_nested_50x100, dumps_class_nested_py_leaf_100x200, create_20_000_run_trees, dumps_dataclass_nested_50x100, dumps_class_nested_py_leaf_50x100, create_10_000_run_trees, create_5_000_run_trees, dumps_pydanticv1_nested_50x100

from __future__ import annotations

Expand Down Expand Up @@ -86,6 +86,7 @@
[schemas.Run, Optional[schemas.Example]],
Union[EvaluationResult, EvaluationResults],
],
Callable[..., Union[dict, EvaluationResults, EvaluationResult]],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we update the docstring for evaluate() and aevaluate() to have examples or link to a docs page that shows the valid arguments?

]
AEVALUATOR_T = Union[
Callable[
Expand Down
73 changes: 72 additions & 1 deletion python/langsmith/evaluation/evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
cast,
)

from typing_extensions import TypedDict
from typing_extensions import TypedDict, get_type_hints

try:
from pydantic.v1 import ( # type: ignore[import]
Expand Down Expand Up @@ -194,6 +194,10 @@ def __init__(
func (Callable): A function that takes a `Run` and an optional `Example` as
arguments, and returns a dict or `ComparisonEvaluationResult`.
"""
func = _normalize_evaluator_func(func)
if afunc:
afunc = _normalize_evaluator_func(afunc) # type: ignore[assignment]

wraps(func)(self)
from langsmith import run_helpers # type: ignore

Expand Down Expand Up @@ -632,3 +636,70 @@ def comparison_evaluator(
) -> DynamicComparisonRunEvaluator:
"""Create a comaprison evaluator from a function."""
return DynamicComparisonRunEvaluator(func)


def _normalize_evaluator_func(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nice to add like a couple unit tests on this to make it obvious it's working

func: Callable,
) -> Union[
Callable[[Run, Optional[Example]], _RUNNABLE_OUTPUT],
Callable[[Run, Optional[Example]], Awaitable[_RUNNABLE_OUTPUT]],
]:
# for backwards compatibility, if args are untyped we assume they correspond to
# Run and Example:
if not (type_hints := get_type_hints(func)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add debug logs letting you know what function type is being used? Might be helpful since we tell people to enable debug logs for debugging issues in the SDK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we check the number of args here? traditional evaluators have run and example whereas the simple evaluators take 3 args

return func
elif {Run, Example, Optional[Example]}.intersection(type_hints.values()):
return func
else:
sig = inspect.signature(func)
num_positional = len(
[
p
for p in sig.parameters.values()
if p.kind in (p.POSITIONAL_OR_KEYWORD, p.POSITIONAL_ONLY)
]
)
has_positional_var = any(
p.kind == p.VAR_POSITIONAL for p in sig.parameters.values()
)
if not (
num_positional in (2, 3) or (num_positional <= 3 and has_positional_var)
):
msg = (
"Invalid evaluator function. Expected to take either 2 or 3 positional "
"arguments. Please see "
"https://docs.smith.langchain.com/evaluation/how_to_guides/evaluation/evaluate_llm_application#use-custom-evaluators" # noqa: E501
)
raise ValueError(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this check on arg length should be moved up


if inspect.iscoroutinefunction(func):

async def awrapper(run: Run, example: Example) -> _RUNNABLE_OUTPUT:
args = (example.inputs, run.outputs or {}, example.outputs or {})
if has_positional_var:
return await func(*args)
else:
return await func(*args[:num_positional])

awrapper.__name__ = (
getattr(func, "__name__")
if hasattr(func, "__name__")
else awrapper.__name__
)
return awrapper # type: ignore[return-value]

else:

def wrapper(run: Run, example: Example) -> _RUNNABLE_OUTPUT:
args = (example.inputs, run.outputs or {}, example.outputs or {})
if has_positional_var:
return func(*args)
else:
return func(*args[:num_positional])

wrapper.__name__ = (
getattr(func, "__name__")
if hasattr(func, "__name__")
else wrapper.__name__
)
return wrapper # type: ignore[return-value]
47 changes: 39 additions & 8 deletions python/tests/unit_tests/evaluation/test_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,11 +184,26 @@ def score_value_first(run, example):
ordering_of_stuff.append("evaluate")
return {"score": 0.3}

def score_unpacked_inputs_outputs(inputs: dict, outputs: dict):
ordering_of_stuff.append("evaluate")
return {"score": outputs["output"]}

def score_unpacked_inputs_outputs_reference(
inputs: dict, outputs: dict, reference_outputs: dict
):
ordering_of_stuff.append("evaluate")
return {"score": reference_outputs["answer"]}

evaluators = [
score_value_first,
score_unpacked_inputs_outputs,
score_unpacked_inputs_outputs_reference,
]
results = evaluate(
predict,
client=client,
data=dev_split,
evaluators=[score_value_first],
evaluators=evaluators,
num_repetitions=NUM_REPETITIONS,
blocking=blocking,
)
Expand Down Expand Up @@ -219,14 +234,14 @@ def score_value_first(run, example):
assert fake_request.created_session
_wait_until(lambda: fake_request.runs)
N_PREDS = SPLIT_SIZE * NUM_REPETITIONS
_wait_until(lambda: len(ordering_of_stuff) == N_PREDS * 2)
_wait_until(lambda: len(ordering_of_stuff) == (N_PREDS * (len(evaluators) + 1)))
_wait_until(lambda: slow_index is not None)
# Want it to be interleaved
assert ordering_of_stuff != ["predict"] * N_PREDS + ["evaluate"] * N_PREDS
assert ordering_of_stuff[:N_PREDS] != ["predict"] * N_PREDS

# It's delayed, so it'll be the penultimate event
# Will run all other preds and evals, then this, then the last eval
assert slow_index == (N_PREDS * 2) - 2
assert slow_index == (len(evaluators) + 1) * (N_PREDS - 1)

def score_value(run, example):
return {"score": 0.7}
Expand Down Expand Up @@ -347,11 +362,27 @@ async def score_value_first(run, example):
ordering_of_stuff.append("evaluate")
return {"score": 0.3}

async def score_unpacked_inputs_outputs(inputs: dict, outputs: dict):
ordering_of_stuff.append("evaluate")
return {"score": outputs["output"]}

async def score_unpacked_inputs_outputs_reference(
inputs: dict, outputs: dict, reference_outputs: dict
):
ordering_of_stuff.append("evaluate")
return {"score": reference_outputs["answer"]}

evaluators = [
score_value_first,
score_unpacked_inputs_outputs,
score_unpacked_inputs_outputs_reference,
]

results = await aevaluate(
predict,
client=client,
data=dev_split,
evaluators=[score_value_first],
evaluators=evaluators,
num_repetitions=NUM_REPETITIONS,
blocking=blocking,
)
Expand Down Expand Up @@ -387,14 +418,14 @@ async def score_value_first(run, example):
assert fake_request.created_session
_wait_until(lambda: fake_request.runs)
N_PREDS = SPLIT_SIZE * NUM_REPETITIONS
_wait_until(lambda: len(ordering_of_stuff) == N_PREDS * 2)
_wait_until(lambda: len(ordering_of_stuff) == N_PREDS * (len(evaluators) + 1))
_wait_until(lambda: slow_index is not None)
# Want it to be interleaved
assert ordering_of_stuff != ["predict"] * N_PREDS + ["evaluate"] * N_PREDS
assert ordering_of_stuff[:N_PREDS] != ["predict"] * N_PREDS
assert slow_index is not None
# It's delayed, so it'll be the penultimate event
# Will run all other preds and evals, then this, then the last eval
assert slow_index == (N_PREDS * 2) - 2
assert slow_index == (N_PREDS - 1) * (len(evaluators) + 1)

assert fake_request.created_session["name"]

Expand Down
Loading