Skip to main content
Once your test has evaluators and cases, you can run evaluations to verify your flow’s behavior.

Running Evaluations

Each case runs your flow independently with its defined inputs, then applies every evaluator against the outputs. Results are surfaced both as an at-a-glance summary on the test level and as detailed breakdowns per case.

How to Run

By default, evaluations run against the current version of your flow. Use the version selector to target a specific saved version instead. Once the version is selected, you can:
  • Run all tests at once — click Run tests from the tests list for full regression testing.
  • Run all cases in a test — use the button on a test row or click Run all from the test detail view.
  • Run individual cases — use the button on a case row or the Run button inside a case.
You can cancel running evaluations at any time with the Cancel all button or the stop button on individual cases.
Evaluation results are not versioned. Switching to a different flow version will mark all existing results as outdated, and cases may fail to run if the version changes the flow’s inputs.

Understanding Results

Test Summary

The summary bar at the top of each test shows the pass rate, average latency, last run time, and token cost. Its background color reflects the overall state: gray (running or not all run), green (all passed), red (any failed), or yellow (any outdated). Test summary

Per-Case Results

Click on a case to open a detailed view. The left panel shows the case inputs and evaluator values, while the right panel displays the run output and each evaluator’s pass/fail status with feedback. You can compare previous runs using the run history dropdown. The right panel also includes full run details for quick debugging — admins have access to execution logs as well. Case results

Execution Errors

When a flow fails to execute, the case is marked with Error — distinct from Failed, which means the flow completed but evaluators didn’t pass. A warning banner appears at the top of the test when execution errors are detected.

Outdated Results

Noxus tracks changes using content hashes. Results are flagged as outdated when any of the following change after a run:
ChangeMeaning
Flow definitionThe workflow logic was modified.
Evaluator configAn evaluator’s settings were changed.
Case dataThe test case inputs or expected outputs were modified.
Evaluator setEvaluators were added to or removed from the test.
Outdated results display a yellow warning triangle and a banner. Click Run outdated tests to re-evaluate only the affected tests.

Best Practices

  • Start with deterministic evaluators — fast, free, and predictable. Add AI evaluators only for subjective quality assessment.
  • Use specific output fields — target individual output connectors for precise assertions.
  • Name cases descriptively — “Long input with special characters” is easier to debug than “Test 1”.
  • Run before deploying a version — treat evaluations like a CI pipeline.
  • Combine evaluator types — layer structural checks (Is JSON, Contains) with quality checks (Rule-Based LLM).
  • Keep cases focused — one behavior per case makes failures easier to pinpoint.