All available evaluator types for testing flow outputs
Evaluators are the rules that score your test case outputs. Each evaluator targets a specific output field from your flow and returns a pass/fail result with optional feedback.You can attach multiple evaluators to a single test. When a case runs, every evaluator is applied independently, and the case passes only if all evaluators pass.
Some evaluator settings act as defaults that can be overridden directly on individual test cases. This lets you reuse a single evaluator across many cases while customizing the expected value for each one.When you open a case, overridable properties appear in the Evaluator values section of the case editor.For example, an Equals evaluator might have a default expected value of "Hello", but for a specific case you can override it to "Goodbye" without creating a separate evaluator.
Properties that support per-case overrides are marked with in the tables below.
Some properties also support dynamic references to your flow’s inputs and outputs. These can be inserted in a compatible field using / or the Insert variable option. They will be represented as chips.
Input — corresponds to the value of a flow input, as mapped in the case. Useful when the expected output should match or contain the original input.
Output — resolves to the actual value produced by the flow when the case is run. Useful when comparing one output against another.
This makes it possible to write evaluators like “the summary output should contain the customer name from the input” without hardcoding values.
Compares the output to an expected string using sequence-based similarity scoring.
Setting
Description
Default
Expected value
The reference string to compare against.
—
Threshold
Minimum score required for this evaluation to pass (0.0 = no match, 1.0 = identical).
0.8
Case sensitive
Whether uppercase and lowercase letters are treated as different (A ≠ a).
Off
Sequence-based similarity — A method that compares the longest contiguous matching subsequences between the output and the expected value. A score of 1.0 means the strings are identical, while 0.0 means they share no common sequences.
AI evaluators use an LLM to assess outputs against natural language criteria. They are more flexible than deterministic evaluators but consume model tokens and may produce slightly different results across runs.
Evaluates the output against free-text rules using an LLM judge.The LLM receives the test case input, the flow output, and your rules. It then assigns a score from 1 to 10 based on how well the output aligns with the rules. The score is compared against your pass threshold to determine if the evaluation passes.
Setting
Description
Default
Rules
Natural language instructions describing what makes a good output (e.g., “The response should be polite and concise”).
—
Pass threshold
Minimum score required for this evaluation to pass.
Evaluates the output against multiple named criteria using an LLM judge.The LLM scores each criterion independently, giving it a score from 1 to 10 based on how well the output aligns with it. The average of the final scores is compared against the pass threshold.This is useful when you want to assess different quality dimensions separately — for example, accuracy, tone, and completeness.
Setting
Description
Default
Criteria
A list of named criteria, each with its own instructions (e.g., name: “Tone”, instructions: “The tone in the response should be formal but witty.”).
—
Pass threshold
Minimum average score required for this evaluation to pass.