Eval Set

Overview

An Eval Set is the actual execution unit of evaluation in Datumo Eval. Each Eval Set represents one complete run where a specific Evaluation Task is executed using a selected Target Model. In other words, an Eval Set is a snapshot capturing “the result of evaluating this model with this Task configuration and Dataset at a particular point in time.”
Because the Eval Set stores snapshots of the Task configuration, Dataset, and Target Model as they were at execution time, each run becomes an independent and reproducible evaluation record. This makes Eval Sets the foundational units for model comparison, A/B testing, and version tracking.

Structure of an Eval Set

An Eval Set consists of both execution metadata and evaluation results. The metadata includes the Target Model, execution status, creation time, and snapshots of the applied Task and Dataset.
Once the evaluation is complete, the Eval Set stores the model’s raw responses, scoring results from the Judge Model or quantitative metrics, reasoning logs, and any error information.
With this structure, users can reliably compare different model versions on the same Task, or re-run evaluations on the same model over time to observe stability and performance drift.

How Eval Sets Are Created and Executed

Eval Sets are created whenever a user runs an Evaluation Task. Since each Task already contains its Dataset and evaluation configuration (scoring rules, Judge settings, etc.), the user only needs to select the Target Model at execution time.
When execution begins, the Datumo Eval pipeline processes each Query in the Dataset by sending it to the Target Model, collecting its Response, and then evaluating it through a Judge Model or quantitative metric. After all items are processed, the results are stored as a complete Eval Set—an immutable record of that specific evaluation run.

Status Management

Eval Sets move through several statuses during execution, making it easy to track progress.
A newly created Eval Set begins in the Pending state and transitions to Running once execution starts. Successful runs finish in the Completed state, while errors lead to the Failed state.
Depending on the Task type, execution may be paused (Paused) and resumed later. These states improve visibility and manageability, especially when running multiple evaluations simultaneously.

Versioning and Use Cases

Eval Sets are central to version management in Datumo Eval.
Running the same Task on multiple versions of a model allows users to track improvements or regressions. Repeatedly evaluating the same model also reveals performance stability and variance over time.
A/B testing—comparing different model settings or architectures—is also performed at the Eval Set level. The Datumo Dashboard provides side-by-side comparisons across Eval Sets, enabling analysis of metric differences, areas of improvement or regression, and Query-level performance patterns.

Reviewing and Re-running Eval Sets

Once an Eval Set is completed, its full results can be explored on the Dashboard. This includes model outputs for each Query, Judge reasoning traces, score distributions, and more.
Depending on the Task configuration, users may re-run only the failed items, re-run Judge evaluation without regenerating model outputs, or perform a full re-run to produce a new Eval Set.
All evaluation results can be exported in CSV or JSON format for further analysis or reporting.

Best Practices

To manage evaluations efficiently, it is helpful to adopt a consistent naming convention for Eval Sets—such as combining model name, version, date, and purpose. This ensures clarity even in large-scale projects.
For reliable comparison, Eval Sets should be generated using the same Task configuration, and having sufficient Dataset size increases statistical validity.
If anomalies appear in the results, investigating those specific Queries often reveals underlying model weaknesses and guides quality improvements.

Overview​

Structure of an Eval Set​

How Eval Sets Are Created and Executed​

Status Management​

Versioning and Use Cases​

Reviewing and Re-running Eval Sets​

Best Practices​