What are datasets for?

The improvement loop has three parts: observe what's happening, identify what's wrong, and test whether your fix works. Datasets serve that last step.

A dataset is a collection of test cases that you run your application against each time you make a change. You get a repeatable, consistent check across everything you care about.

TODO: loop visual

Because LLM outputs aren't deterministic, running a dataset isn't always a pass/fail check like a traditional test suite. You can also evaluate outputs with scoring functions that measure qualities like correctness, tone, or completeness. The experiments section covers in more detail how this works in practice.

The dataset item

A dataset is made up of items, each item represents one test case: a situation your application should be able to handle. Generally, an item has three fields:

Input (required)
Expected output (optional)
Metadata (optional)

What should go where?

A good mental model is:

Field	Purpose
Input	Input needed for the task you're testing
Metadata	Any additional context that's helpful when scoring the result
Expected output	Defines what a correct or good response looks like

Different ways to use the expected output field

Some evaluators check the output against a predefined expected output (reference-based). Others assess the output without needing a ground truth to compare against (reference-free). Whether you need an expected output, and what it looks like, depends on which type of evaluator you use.

Exact match The expected output is the literal correct answer. For example:

A classification task where the correct label is "billing_inquiry"
An extraction task where the expected entities are ["Paris", "Thursday"]

Reference answer The expected output is a gold-standard response that shows what a good output looks like. The evaluator can compare the test's output against this example, for instance by checking semantic similarity or whether the key points match.

Evaluation criteria The expected output is a list of checks or requirements the output should satisfy. For example:

"must mention the refund policy"
"must include a link to the help center"

The evaluator checks whether the output meets these criteria.

Not needed Sometimes no expected output is required at all. If you're just checking whether:

the tone is professional
the response is safe
the output follows a required format

Your dataset items don't need anything other than an input.

When should you split into separate datasets?

When do you keep adding items to your dataset, and at what point do you decide to split them up?

Different tasks = different datasets

Different agents always get separate datasets. You run one task/agent over the complete dataset.

You need different evaluators for different items

When different items require fundamentally different evaluation logic, it becomes easier to separate them into datasets that each have a clear evaluation task.

If you don't want to split up your dataset, you can work around this by encoding evaluator configuration in each item's expected output. This works at small scale, but as the dataset grows, expected outputs become a mix of different shapes and it gets hard to tell what each item is actually testing.

You need different run cadences

A common pattern is having a smaller set of test cases that runs on every small change before it can be deployed to production, and a larger, comprehensive set that runs before major releases.

If all of these live in one dataset, you need custom filtering logic to select the right subset for each run. Separate datasets make the boundaries explicit.

Your dataset is getting hard to navigate

When your dataset is small, one dataset is easy to manage. Once it starts growing, covering different scenarios and failure modes, managing the set becomes more difficult. This becomes a clear sign splitting into focused datasets would make sense. It will make it easier to maintain and reason about coverage.

TODO: add some specific examples

Where to start

TODO: How-to guide on creating datasets from traces?

Was this page helpful?

On this page