Each case runs your flow independently with its defined inputs, then applies every evaluator against the outputs. Results are surfaced both as an at-a-glance summary on the test level and as detailed breakdowns per case.
By default, evaluations run against the current version of your flow. Use the version selector to target a specific saved version instead.Once the version is selected, you can:
Run all tests at once — click Run tests from the tests list for full regression testing.
Run all cases in a test — use the button on a test row or click Run all from the test detail view.
Run individual cases — use the button on a case row or the Run button inside a case.
You can cancel running evaluations at any time with the Cancel all button or the stop button on individual cases.
Evaluation results are not versioned. Switching to a different flow version will mark all existing results as outdated, and cases may fail to run if the version changes the flow’s inputs.
The summary bar at the top of each test shows the pass rate, average latency, last run time, and token cost. Its background color reflects the overall state: gray (running or not all run), green (all passed), red (any failed), or yellow (any outdated).
Click on a case to open a detailed view. The left panel shows the case inputs and evaluator values, while the right panel displays the run output and each evaluator’s pass/fail status with feedback.You can compare previous runs using the run history dropdown. The right panel also includes full run details for quick debugging — admins have access to execution logs as well.
When a flow fails to execute, the case is marked with Error — distinct from Failed, which means the flow completed but evaluators didn’t pass. A warning banner appears at the top of the test when execution errors are detected.