Skip to main content

Datumo Eval Terminology

Overview

A glossary of core terms used throughout the Datumo Eval platform and documentation.

🟦 Generation

Context Set

A collection of reference documents used as the basis for dataset generation.

Query Set

A collection of evaluation questions.

Response Set

A collection of model-generated responses.

Expected Response

A reference answer that represents an ideal or intended response.

Ground Truth (GT)

The authoritative correct answer used as an evaluation baseline.

Metadata

Additional attributes such as difficulty level, domain, and categories.

Chunk

A segmented unit of a document.

Context

Background information provided to a model, typically a set of multiple chunks.

Reference Context

The original context used as a source during dataset generation.

Special Query Columns

Reserved query fields used for evaluation, such as `expected_response`, `ground_truth`, `gold_answer`, and `reference_context`.

Special Response Columns

Reserved response fields used for evaluation, such as `retrieved_context` and `retrieved_chunk`.


🟧 Metric Management

Core Concepts

Metric

A defined evaluation criterion used to measure model output quality. May include rubrics, judge prompts, and output schemas.

Rubric

Detailed scoring rules that define how a metric should be assessed.

Judge Prompt

A structured prompt guiding the AI judge in automated evaluation.

Required Fields

Dataset fields that must be present for a metric to run successfully.

Metric Types

Likert Scale

A human-evaluation scoring method using an ordinal scale.

Traditional Metrics

String-matching or semantic similarity metrics such as BLEU, METEOR, TER, ROUGE, and BERTScore.

RAGAS Metrics

Metrics for evaluating RAG systems, including faithfulness and answer relevance.

BEIR Metrics

Retrieval performance metrics such as nDCG, Recall, and MRR.

Taxonomy

A classification framework that organizes metrics by evaluation domain or type.


🟩 Evaluation

Evaluation Units

Evaluation Task

A project-level unit designed for a specific evaluation purpose.

Evaluation Set

An executable evaluation unit consisting of metrics, models, and responses.

Evaluation Types

Judge Evaluation

Automated quality assessment using an AI judge.

Human Evaluation

Subjective evaluation performed by human annotators.

Quantitative Evaluation

Numerical performance measurement using objective metrics.

Model Roles

Target Model

The model being evaluated.

Agent

An AI system that interacts with environments to achieve specific goals.

Judge Model

A model responsible for performing automated evaluation.

Generation Model

A model used to generate datasets.

Embedding Model

A model used to compute embeddings for RAGAS evaluation.

Evaluators

LLM Judge

An evaluator that uses LLMs to automatically score metrics.

Manual Evaluator

A human evaluator assessing responses using predefined rubrics.

Algorithmic Evaluator

An evaluator computing algorithmic scores such as BLEU or ROUGE.

RAGAS Evaluator

An evaluator dedicated to computing RAGAS metrics.

Harness Task

A standardized benchmark evaluation using public datasets.

Results

Dashboard

A visualization interface for evaluation results.

Table View

A tabular display showing individual evaluation records.

Leaderboard

A ranked comparison of model performance across tasks.

RAG Concepts

RAG (Retrieval-Augmented Generation)

A technique that augments LLMs with external knowledge retrieved from a corpus.

Claim

A verifiable unit of information derived from an LLM response.

Entailment

A logical relationship indicating that a claim is supported by the provided context.

Hallucination

A phenomenon where the model generates information that is factually incorrect or unsupported.


🔴 Auto Red-Teaming

Strategy

A high-level blueprint defining an adversarial testing approach.

Seed Data

Initial prompts or scenarios used to generate attack variations.

Attack Prompt

A constructed prompt combining seed data and strategy to probe model vulnerabilities.

Strategy Library

A repository of reusable adversarial strategies.

ASR (Attack Success Rate)

The proportion of responses classified as unsafe.


User & Access Management

Workspace

A collaborative space for sharing datasets, projects, and evaluation results.

Admin

A workspace manager with permissions for user, dataset, and project administration.

User

A general workspace member operating within assigned permissions.