-
Notifications
You must be signed in to change notification settings - Fork 645
feat!: version 13 - dataset evaluators #9642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
mikeldking
wants to merge
51
commits into
main
Choose a base branch
from
version-13
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
+27,423
−1,890
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
commit: |
RogerHYang
requested changes
Sep 25, 2025
Contributor
RogerHYang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking feature branch
001f109 to
b65ea42
Compare
An error occurred while trying to automatically change base from
feat/version-12
to
main
September 29, 2025 18:13
fc49ed1 to
2b19c56
Compare
f4ab1f0 to
2ef94fd
Compare
…s useful (#10187) * feat(evaluators): provide a useful correctness pre-built evaluator * feat(evaluators): provide a useful correctness pre-built evaluator * simplify
* evaluator prompt validation * cursor tests * clean * condense * test * clean * clean * test * parse pydantic errors * clean * validate mutations * fix tests * validate choices * test with form * test * type check * clean
* include only dataset-specific evaluators in playground eval selector * fix dataset page tab selection * add aria label to dialog * add annotation names to playground select * handle long annotation names * separate components for DatasetEvaluatorSelect and PlaygroundEvaluatorSelect * remove extra opacity css var * updates to Menu * updates to evaluator menus * fix menu item flicker * wip: enable mapping evaluator from playground * formatting
* add eval outputs to playground output cell * add evaluation details popover & trace link * include evals in output for non-streaming playground runs * fix unnecessary truncation of eval name * handle evaluations on error * fix evaluation name * rerun CI * prevent losing example data when handling tool chunk --------- Co-authored-by: Alexander Song <[email protected]>
* feat: Create distinct slideovers for evaluator use cases * fix: manually update updated_at when creating llm_evaluator * fix: global change to combobox, opens submenu on enter --------- Co-authored-by: Rick Steele <[email protected]>
* Spike out builtin evaluator interfaces * Get builtin evaluator if it exists * Refine data model * Simplify models * Implement literal/path mapping logic * Wire up builtin evaluators * Persist single-run evaluations as SpanAnnotations * Update gql schema and run relay compiler * Fix evaluation over playground dataset run * ruff * Fix queries w.r.t BuiltInEvaluator * Add built in evaluators to dataset evaluators query * Add xfail to dataset evaluator test * Ignore missing type stubs * fix evaluators over single chat * fix ts ci --------- Co-authored-by: Tony Powell <[email protected]> Co-authored-by: Alexander Song <[email protected]>
* wip: enable unassigning a dataset evaluator * update cached evaluator data upon assignment/unassignment * add confirmation dialog * wire up evaluator unlink with optional delete * remove row selectability * add comment * use alert banner instead of toast for errors * explicitly close dialog on successful delete/unlink
* fix evaluator config dialog header overflow * fix dataset select overflow * styling * dataset select styling
* feat: Add builtin evaluator support to crosswalk table * Fix migration and updqte gql schema * Fix relationship definition * feat: Add prebuilt evaluators to template submenu * Tweak language * feat: Support input mapping code evaluators * Improve dataset messaging in evaluator form * update default evaluator template * Add DatasetExampleSelect component Also makes combobox and dataset select more responsive * Allow users to edit evaluator input preview * Fix db constraints for input mapping * Wire up input mapping end to end * Fix ruff * use fastapi instead of starlette import * Remove xfail and clean up input_config handling * Ruff * Verify evaluator id existence for type checker * Build gql schema and run relay compiler * Pull output from input-mapped inputs * Insure input config is stored as JSON * Add minWidth prop to Select * Fix evaluator config dialog header truncation * Use both unique constraint and partial index * Add builtin evaluators to dataloader * Call lower() after str conversion * Rename evaluator for simplicity * Remove explicit constraint name * Update variable name * Address PR feedback * Change column name from input_config to input_mapping * Update tests and other input_config references * Make mypy happy --------- Co-authored-by: Dustin Ngo <[email protected]>
6da124a to
5f12fd4
Compare
* remove global evaluator create flow * remove type dependancy
* refactor table cell helpers to be shared between experiments table & playground * remove unused code * refactor annotations list to shared commponent * add data test id
* add wrapper for GridList * update playground eval select to use GridList * add edit evaluator button to menu * add menu section header * fix selection & hover/focus states * update stories * remove unused code * update GridList section styles * pairing with rick * restore dataset id arg * remove comment * address cursor PR comments * update dataset eval menu to use MenuItem * address feedback
* feat: prompt template apply query * feat: apply prompt template * refactor for re-use * add more validation * fix tests * remove tool call parts * remove tool result application * Update src/phoenix/server/api/queries.py Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com> * cleanup --------- Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>
* add preview query * cleanup * cleanup * cleanup
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
feature branch
a feature branch that consolidates multiple features into a single commit on main
size:XS
This PR changes 0-9 lines, ignoring generated files.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
this is the feature branch for the upcoming version 13