feat!: version 13 - dataset evaluators #9642

mikeldking · 2025-09-25T20:54:32Z

this is the feature branch for the upcoming version 13

pkg-pr-new · 2025-09-25T20:57:43Z

npm i https://pkg.pr.new/Arize-ai/phoenix/@arizeai/phoenix-client@9642

npm i https://pkg.pr.new/Arize-ai/phoenix/@arizeai/phoenix-mcp@9642

commit: 7a0c5f7

RogerHYang

blocking feature branch

…s useful (#10187) * feat(evaluators): provide a useful correctness pre-built evaluator * feat(evaluators): provide a useful correctness pre-built evaluator * simplify

* evaluator prompt validation * cursor tests * clean * condense * test * clean * clean * test * parse pydantic errors * clean * validate mutations * fix tests * validate choices * test with form * test * type check * clean

…10253)

* include only dataset-specific evaluators in playground eval selector * fix dataset page tab selection * add aria label to dialog * add annotation names to playground select * handle long annotation names * separate components for DatasetEvaluatorSelect and PlaygroundEvaluatorSelect * remove extra opacity css var * updates to Menu * updates to evaluator menus * fix menu item flicker * wip: enable mapping evaluator from playground * formatting

…10292)

* add eval outputs to playground output cell * add evaluation details popover & trace link * include evals in output for non-streaming playground runs * fix unnecessary truncation of eval name * handle evaluations on error * fix evaluation name * rerun CI * prevent losing example data when handling tool chunk --------- Co-authored-by: Alexander Song <[email protected]>

* feat: Create distinct slideovers for evaluator use cases * fix: manually update updated_at when creating llm_evaluator * fix: global change to combobox, opens submenu on enter --------- Co-authored-by: Rick Steele <[email protected]>

* Spike out builtin evaluator interfaces * Get builtin evaluator if it exists * Refine data model * Simplify models * Implement literal/path mapping logic * Wire up builtin evaluators * Persist single-run evaluations as SpanAnnotations * Update gql schema and run relay compiler * Fix evaluation over playground dataset run * ruff * Fix queries w.r.t BuiltInEvaluator * Add built in evaluators to dataset evaluators query * Add xfail to dataset evaluator test * Ignore missing type stubs * fix evaluators over single chat * fix ts ci --------- Co-authored-by: Tony Powell <[email protected]> Co-authored-by: Alexander Song <[email protected]>

* wip: enable unassigning a dataset evaluator * update cached evaluator data upon assignment/unassignment * add confirmation dialog * wire up evaluator unlink with optional delete * remove row selectability * add comment * use alert banner instead of toast for errors * explicitly close dialog on successful delete/unlink

* fix evaluator config dialog header overflow * fix dataset select overflow * styling * dataset select styling

* feat: Add builtin evaluator support to crosswalk table * Fix migration and updqte gql schema * Fix relationship definition * feat: Add prebuilt evaluators to template submenu * Tweak language * feat: Support input mapping code evaluators * Improve dataset messaging in evaluator form * update default evaluator template * Add DatasetExampleSelect component Also makes combobox and dataset select more responsive * Allow users to edit evaluator input preview * Fix db constraints for input mapping * Wire up input mapping end to end * Fix ruff * use fastapi instead of starlette import * Remove xfail and clean up input_config handling * Ruff * Verify evaluator id existence for type checker * Build gql schema and run relay compiler * Pull output from input-mapped inputs * Insure input config is stored as JSON * Add minWidth prop to Select * Fix evaluator config dialog header truncation * Use both unique constraint and partial index * Add builtin evaluators to dataloader * Call lower() after str conversion * Rename evaluator for simplicity * Remove explicit constraint name * Update variable name * Address PR feedback * Change column name from input_config to input_mapping * Update tests and other input_config references * Make mypy happy --------- Co-authored-by: Dustin Ngo <[email protected]>

* remove global evaluator create flow * remove type dependancy

* refactor table cell helpers to be shared between experiments table & playground * remove unused code * refactor annotations list to shared commponent * add data test id

* add wrapper for GridList * update playground eval select to use GridList * add edit evaluator button to menu * add menu section header * fix selection & hover/focus states * update stories * remove unused code * update GridList section styles * pairing with rick * restore dataset id arg * remove comment * address cursor PR comments * update dataset eval menu to use MenuItem * address feedback

* feat: prompt template apply query * feat: apply prompt template * refactor for re-use * add more validation * fix tests * remove tool call parts * remove tool result application * Update src/phoenix/server/api/queries.py Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com> * cleanup --------- Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>

* add preview query * cleanup * cleanup * cleanup

mikeldking requested review from a team as code owners September 25, 2025 20:54

github-project-automation bot added this to phoenix Sep 25, 2025

github-project-automation bot moved this to 📘 Todo in phoenix Sep 25, 2025

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Sep 25, 2025

mikeldking changed the base branch from main to feat/version-12 September 25, 2025 20:56

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Sep 25, 2025

mikeldking changed the title ~~version 13~~ feat!: version 13 - dataset evaluators Sep 25, 2025

mikeldking added the feature branch a feature branch that consolidates multiple features into a single commit on main label Sep 25, 2025

mikeldking marked this pull request as draft September 25, 2025 21:22

RogerHYang requested changes Sep 25, 2025

View reviewed changes

github-project-automation bot moved this from 📘 Todo to 🔍. Needs Review in phoenix Sep 25, 2025

axiomofjoy force-pushed the feat/version-12 branch from 001f109 to b65ea42 Compare September 29, 2025 17:18

Base automatically changed from feat/version-12 to main September 29, 2025 18:13

An error occurred while trying to automatically change base from feat/version-12 to main September 29, 2025 18:13

mikeldking force-pushed the version-13 branch from 3ed12e6 to 4ad9dc8 Compare October 2, 2025 08:01

RogerHYang force-pushed the version-13 branch from 4ad9dc8 to 4da307b Compare October 6, 2025 17:24

RogerHYang removed this from phoenix Oct 6, 2025

RogerHYang force-pushed the version-13 branch from 5afdccb to e0e6709 Compare October 8, 2025 06:19

RogerHYang force-pushed the version-13 branch 4 times, most recently from fc49ed1 to 2b19c56 Compare October 24, 2025 15:47

RogerHYang force-pushed the version-13 branch 3 times, most recently from f4ab1f0 to 2ef94fd Compare October 29, 2025 15:31

mikeldking closed this Nov 4, 2025

mikeldking reopened this Nov 4, 2025

cephalization and others added 17 commits November 30, 2025 14:37

feat: Add evaluators table to dataset evaluators page (#10157)

bb02afa

fix: Fix import error on evaluator page (#10185)

845bf42

feat(evaluators): load in a default template for the evaluator that i…

44f79ea

…s useful (#10187) * feat(evaluators): provide a useful correctness pre-built evaluator * feat(evaluators): provide a useful correctness pre-built evaluator * simplify

fix: eslint errors

149b910

ci: add ci for 12 (#10196)

be4dedc

feat: persist tools with eval (#10220)

a11634f

feat: Refactor evaluator form for usage in create and edit workflows (#…

fc0b1cd

…10253)

only include dataset-specific evaluators in playground eval select (#…

bae1df4

…10292)

fix(evaluators): clean up evaluators rebase

468ca87

fix: fix evaluator config dialog layout (#10366)

1b5e9f8

* fix evaluator config dialog header overflow * fix dataset select overflow * styling * dataset select styling

RogerHYang force-pushed the version-13 branch from 6da124a to 5f12fd4 Compare November 30, 2025 22:39

RogerHYang and others added 12 commits December 1, 2025 10:13

feat: data model for custom providers (#10319)

c176dc4

refactor: add preview toggle for evaluator template (#10432)

3cd5425

fix some react compiler issues (#10443)

ce990a6

fix: fix regex to match composite IDs (#10459)

98a1086

fix: remove the ability to create a global evaluator" (#10461)

4eeaf07

* remove global evaluator create flow * remove type dependancy

refactor: playground output components (#10458)

6252f29

* refactor table cell helpers to be shared between experiments table & playground * remove unused code * refactor annotations list to shared commponent * add data test id

feat: is_latest flag on a version (#10464)

2daee4a

add optimization direction input to evaluator form (#10471)

947ff5a

feat: show latest instead of version (#10468)

ae09d9f

feat: prompt template preview (#10453)

ae9e1aa

* add preview query * cleanup * cleanup * cleanup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat!: version 13 - dataset evaluators #9642

feat!: version 13 - dataset evaluators #9642

mikeldking commented Sep 25, 2025 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Sep 25, 2025

Uh oh!

RogerHYang left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

feat!: version 13 - dataset evaluators #9642

Are you sure you want to change the base?

feat!: version 13 - dataset evaluators #9642

Conversation

mikeldking commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pkg-pr-new bot commented Sep 25, 2025

Uh oh!

RogerHYang left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mikeldking commented Sep 25, 2025 •

edited

Loading