Eval Dataset

Overview

A Dataset is the starting point of the Datumo Eval pipeline and defines what inputs a model receives and the criteria by which its outputs are evaluated. The scope, difficulty, and interpretability of any evaluation depend entirely on how the Dataset is constructed.

Datumo Dataset Concept & Structure

In Datumo Eval, a Dataset is a structured collection of Query items and their corresponding baseline Response entries, optionally accompanied by Context when document-based grounding is required. All Datasets follow a strict 1:1 mapping between Query and Response, ensuring consistent and deterministic evaluation behavior.

When document-grounded validation is needed—such as in RAG evaluations—the Dataset may include Context derived from the original reference documents, or store retrieved_context as metadata after evaluation. Here, Context refers to the source material used when constructing the Dataset and serves as the grounding reference against which the model’s reasoning and factual alignment can be assessed.

This structure is applied consistently across Task configuration, model execution, and result analysis. A Dataset is not merely a list of questions; it defines the entire evaluation boundary and determines how model performance can be compared and interpreted.

Structure of a Dataset

A Dataset in Datumo Eval is composed of three optional elements that are selected and combined depending on the evaluation type:

Context

Context represents the underlying source material—documents, manuals, knowledge articles, policies, reports, and similar content—used to construct the Dataset. Users rely on Context when creating Query–Response pairs, ensuring that each question and its corresponding answer are grounded in verifiable source content.

Context serves as a provenance layer that clarifies where evaluation items originate and why a given Response is correct. In document-based or fact-checking evaluations, it provides the basis for validating correctness. In RAG evaluations, Context can be compared against retrieved_context to determine whether the model relied on appropriate supporting evidence. Thus, Context is a foundational reference layer that shapes the scope and nature of the Dataset.

Query

A Query is the input prompt, question, or scenario presented to the model during evaluation. The overall quality and expressiveness of a Dataset depend largely on how Queries are designed—their difficulty, phrasing, category, and evaluation intent. Every Dataset record must contain exactly one Query.

Model output is generated in response to the Query, and this output is scored against either the baseline Response or the Metrics configured in the evaluation. For this reason, Query design is the starting point of the Dataset and has the greatest influence on evaluation quality.

Response

The Response is the baseline answer—often the ground truth or reference answer—expected for a given Query. Most evaluation types in Datumo Eval (Judgment-based scoring, quantitative Metrics, and RAG evaluation) rely on this Response to determine the correctness or quality of the model’s output.

A Response can be plain text or structured output and should be constructed carefully, as inaccuracies in the baseline Response will directly undermine evaluation reliability. Because the Response serves as the primary comparison target, it is one of the most critical components when building a high-quality Dataset.

Metadata (Optional Extended Attributes)

Metadata provides additional attributes beyond the Query–Response structure and can include any supplementary information required for specific evaluation Metrics or analytical purposes. Metadata may include document alignment indicators, difficulty labels, domain identifiers, or system-generated fields such as retrieved_context, retrieval rank, or evidence-level judgments in RAG analysis.

When a Metric is configured to reference a Metadata field, Datumo Eval uses it directly within the evaluation engine. This enables advanced evaluation techniques such as context-faithfulness scoring, category-level breakdowns, difficulty-based performance variance analysis, and segment-based quality assessment. Metadata significantly enhances the depth and precision of model evaluation.

Difference Between Dataset and Red Teaming Seed

A Dataset is the formal evaluation dataset used in Datumo Eval. It contains structured Query–Response pairs (with optional Context) and serves as the basis for quantitative performance measurement and version-over-version comparison. It functions as the authoritative benchmark for model evaluation within Tasks and Eval Sets.

In contrast, Red Teaming Seeds are an entirely separate data type. Seeds are sets of adversarial prompts designed to probe model vulnerabilities, observe model behavior under risk scenarios, and identify safety weaknesses. Seeds contain no expected Response and are not part of any Evaluation Dataset. They are managed as Red Teaming Benchmarks and are never merged or reused within Evaluation workflows.

Thus, Dataset and Seed serve completely different purposes—one for structured evaluation, the other for vulnerability exploration—and should not be considered substitutes or hierarchical variants of one another.

Dataset in the Evaluation Workflow

The Dataset is the anchor of the entire evaluation flow in Datumo Eval. It determines what input the model receives and defines the standards against which outputs are compared. All model-generated responses are scored relative to the Query–Response structure, and the fidelity of the evaluation depends directly on Dataset quality.

Metadata further enriches this structure by enabling advanced Metrics such as context-faithfulness, reasoning-path validation, difficulty-segment analysis, and domain-specific diagnostics.

While Tasks and Eval Sets provide higher-level organization and execution management, the Dataset supplies the evaluation logic’s core reference point. When an Eval Set is created, Datumo Eval processes each Dataset entry sequentially, runs the Target Model, and computes scores using Metrics and Response comparisons.

All downstream analysis—category-specific patterns, difficulty-driven error trends, model regression tracking—is performed at the Dataset level. This is why the Dataset is not simply a collection of prompts; it is the structural basis that defines evaluation scope, scoring criteria, and analytical granularity within Datumo Eval.

Datumo Dataset Concept & Structure​

Structure of a Dataset​

Context​

Query​

Response​

Metadata (Optional Extended Attributes)​

Difference Between Dataset and Red Teaming Seed​

Dataset in the Evaluation Workflow​