Datumo Eval Terminology
A glossary of core terms used throughout the Datumo Eval platform and documentation.
🟦 Generation
Context Set
A collection of reference documents used as the basis for dataset generation.
Query Set
A collection of evaluation questions.
Response Set
A collection of model-generated responses.
Expected Response
A reference answer that represents an ideal or intended response.
Ground Truth (GT)
The authoritative correct answer used as an evaluation baseline.
Metadata
Additional attributes such as difficulty level, domain, and categories.
Chunk
A segmented unit of a document.
Context
Background information provided to a model, typically a set of multiple chunks.
Reference Context
The original context used as a source during dataset generation.
Special Query Columns
Reserved query fields used for evaluation, such as `expected_response`, `ground_truth`, `gold_answer`, and `reference_context`.
Special Response Columns
Reserved response fields used for evaluation, such as `retrieved_context` and `retrieved_chunk`.
🟧 Metric Management
Core Concepts
Metric
A defined evaluation criterion used to measure model output quality. May include rubrics, judge prompts, and output schemas.
Rubric
Detailed scoring rules that define how a metric should be assessed.
Judge Prompt
A structured prompt guiding the AI judge in automated evaluation.
Required Fields
Dataset fields that must be present for a metric to run successfully.
Metric Types
Likert Scale
A human-evaluation scoring method using an ordinal scale.
Traditional Metrics
String-matching or semantic similarity metrics such as BLEU, METEOR, TER, ROUGE, and BERTScore.
RAGAS Metrics
Metrics for evaluating RAG systems, including faithfulness and answer relevance.
BEIR Metrics
Retrieval performance metrics such as nDCG, Recall, and MRR.
Taxonomy
A classification framework that organizes metrics by evaluation domain or type.
🟩 Evaluation
Evaluation Units
Evaluation Task
A project-level unit designed for a specific evaluation purpose.
Evaluation Set
An executable evaluation unit consisting of metrics, models, and responses.
Evaluation Types
Judge Evaluation
Automated quality assessment using an AI judge.
Human Evaluation
Subjective evaluation performed by human annotators.
Quantitative Evaluation
Numerical performance measurement using objective metrics.
Model Roles
Target Model
The model being evaluated.
Agent
An AI system that interacts with environments to achieve specific goals.
Judge Model
A model responsible for performing automated evaluation.
Generation Model
A model used to generate datasets.
Embedding Model
A model used to compute embeddings for RAGAS evaluation.
Evaluators
LLM Judge
An evaluator that uses LLMs to automatically score metrics.
Manual Evaluator
A human evaluator assessing responses using predefined rubrics.
Algorithmic Evaluator
An evaluator computing algorithmic scores such as BLEU or ROUGE.
RAGAS Evaluator
An evaluator dedicated to computing RAGAS metrics.
Harness Task
A standardized benchmark evaluation using public datasets.
Results
Dashboard
A visualization interface for evaluation results.
Table View
A tabular display showing individual evaluation records.
Leaderboard
A ranked comparison of model performance across tasks.
RAG Concepts
RAG (Retrieval-Augmented Generation)
A technique that augments LLMs with external knowledge retrieved from a corpus.
Claim
A verifiable unit of information derived from an LLM response.
Entailment
A logical relationship indicating that a claim is supported by the provided context.
Hallucination
A phenomenon where the model generates information that is factually incorrect or unsupported.
🔴 Auto Red-Teaming
Strategy
A high-level blueprint defining an adversarial testing approach.
Seed Data
Initial prompts or scenarios used to generate attack variations.
Attack Prompt
A constructed prompt combining seed data and strategy to probe model vulnerabilities.
Strategy Library
A repository of reusable adversarial strategies.
ASR (Attack Success Rate)
The proportion of responses classified as unsafe.
User & Access Management
Workspace
A collaborative space for sharing datasets, projects, and evaluation results.
Admin
A workspace manager with permissions for user, dataset, and project administration.
User
A general workspace member operating within assigned permissions.