Datumo-Concepts

DATUMO Eval Overview

DATUMO Eval is a comprehensive evaluation platform for systematically verifying the quality and safety of generative AI services. Through various evaluation methods and specialized features, you can measure and improve AI model performance from multiple perspectives.

DATUMO Eval is a comprehensive evaluation platform designed to systematically verify the quality and safety of generative AI services.
It provides multiple evaluation methods and specialized capabilities to assess and improve AI model performance from various perspectives.

Introduces key evaluation modes and features including Judge, Human, and Quantitative evaluations.

The starting point and core input unit of the evaluation pipeline.

Explains the roles and configurations of the Target Model and Judge Model.

The core unit that defines the goal, criteria, and methodology of an evaluation.

Quantitative criteria for evaluating model responses.

The actual execution unit of an evaluation, used to manage results.

The central hub for visualizing and analyzing evaluation results.

PART 1: Evaluation Methods

DATUMO Eval consists of three main evaluation methods:

Judge Evaluation

Automated quality evaluation using AI Judge → Achieve both human-level judgment and large-scale automation → Currently supports single-turn based evaluation, multi-turn evaluation coming soon

Human Evaluation

Subjective quality verification using human judgment → Evaluate elements that are difficult to measure numerically, such as nuance, creativity, and appropriateness

Quantitative Evaluation

Numerical performance measurement based on objective metrics → Objective comparison between models through reproducible and standardized benchmarks

Each method can be used independently or combined complementarily for highly reliable evaluation.

PART 2: Evaluation Features

Below are the specific evaluation features provided by DATUMO Eval. Each feature is optimized for specific evaluation methods and purposes.

1. Judge Evaluation

A method that automatically evaluates the response quality and stability of generative AI using AI Judge models. It is characterized by mimicking human judgment while enabling large-scale automation.

1-1) Evaluation Task

The most versatile evaluation framework that performs evaluation based on service responses to evaluation queries. Applicable to most service types, it consists of the following main evaluation items:

Main Evaluation Areas:

Safety Evaluation: Evaluates how safely generative AI services respond to adversarial prompts that induce "unauthorized outputs (biased statements, etc.)".
RAG Quality Evaluation: Evaluates the appropriateness of information utilization for responses generated by RAG (Retrieval-Augmented Generation) systems based on document retrieval.

Detailed Features:

Evaluation Task Management: Create and execute evaluation processes in Task units
Evaluation Set Management: Systematically manage evaluation datasets, with ability to stop/restart evaluation per Set
Evaluation Dashboard: Visually check overall evaluation results and comparative analysis between Tasks
Detailed Result Analysis: Check detailed performance results by question and item, and derive improvements

1-2) RAGAs Task

An automatic evaluation feature using RAGAs (Retrieval-Augmented Generation Assessment) metrics.

Main Features:

Measure retrieval quality and generation quality of RAG systems using standardized metrics
Multi-dimensional evaluation of the relationship between retrieved context and generated responses
Quick performance assessment through automated numerical evaluation

Evaluation Metrics:

Answer Correctness: Evaluates how accurately the generated response matches the Ground Truth. Considers both Factuality and Semantic Similarity.
Response Relevancy: Evaluates relevance to the question. Scores decrease if the response is incomplete or contains unnecessary information. Scores range from 0-1, closer to 1 is better.
Semantic Similarity: Measures how semantically similar the generated response is to the Ground Truth. A metric that numerically expresses semantic alignment.
Context Entity Recall: Calculates recall based on how well entities in the Ground Truth are included in the retrieved Context.
LLM Context Precision With GT: Evaluates how highly relevant items rank among the contexts selected by the model. Measures average precision of relevant information.
LLM Context Recall: Evaluates how sufficiently the retrieved context includes relevant information based on the given question and Ground Truth.
Factual Correctness: A metric that evaluates the factual accuracy of generated responses. Decomposes claims in the response and compares them with reference documents to determine the truth of each claim.
Faithfulness: Measures how factually consistent the response is with the retrieved Context. Values closer to 1 indicate more faithful responses to the context.
Noise Sensitivity: Measures the frequency at which the system generates incorrect responses when using relevant or irrelevant documents. Lower values indicate more robust (stable) performance against noise.
Answer Accuracy: Evaluates how well the model response matches the Ground Truth for a question. Uses "LLM-as-a-judge" evaluation method for scoring.
Context Relevance: Evaluates how closely the retrieved Context relates to the user's question. Higher relevance receives better scores.
Response Groundedness: Evaluates how well the model's response is grounded in the provided context. Higher scores when each claim in the response can find evidence in the context.

1-3) RAG Checker

Evaluates RAG system responses based on ER (Expected Response) for evaluation queries. Measures whether claims in the ER (which serves as a model answer) are included in the retrieved documents (Context) and RAG system responses to evaluate the performance of both the Generator module and Retriever module of the RAG system separately.

Evaluation Mechanism:

Extract and decompose claims in the Expected Response
Verify if claims are included in retrieved documents (Context) → Retriever Performance Evaluation
Verify if claims are reflected in RAG system responses → Generator Performance Evaluation

1-4) Auto Red-Teaming

A system that automatically generates adversarial prompts using an attack scenario library and verifies the safety and vulnerabilities of AI models.

Main Features:

Automatically apply various attack strategies (Jailbreak, Prompt Injection, etc.)
Detect model vulnerabilities through repeated adversarial testing
Identify subtle risks that are difficult to discover through Safety Evaluation alone

2. Human Evaluation

A method where humans directly review and evaluate AI responses, judging elements that are difficult to measure numerically.

2-1) Manual Evaluation

A feature where evaluators systematically evaluate AI responses based on predefined evaluation criteria (Rubric).

Main Features:

Ensure consistency through clear evaluation rubrics
Compare results between multiple evaluators and analyze reliability
Score by detailed items and write comments
Optimized for complex response quality evaluation requiring qualitative judgment

2-2) Interactive Evaluation

An interactive system that evaluates response quality immediately while conversing with AI models in real-time.

Main Features:

Check and evaluate responses immediately after query input
Simple immediate feedback such as Good/Bad
Write Ground Truth (GT) and suggest improvements
Useful for rapid prototype testing and exploratory evaluation

3. Quantitative Evaluation

A method that measures model performance through objective and reproducible numerical metrics. Enables objective comparison and benchmarking between models using standardized metrics.

3-1) Harness Task

A system that measures AI model performance using standardized benchmark datasets and compares them through leaderboards.

Supported Datasets:

HRM8K: Mathematical reasoning and problem-solving ability evaluation
KMMLU: Korean multi-domain knowledge comprehension evaluation (Korean Multi-domain Multi-task Language Understanding)
KOBEST: Korean natural language understanding benchmark (Korean Benchmark Suite for Natural Language Understanding)
Selected subsets of other global standard benchmarks

Main Features:

Support for standard benchmarks widely recognized in academia and industry
Large-scale testing through automated evaluation
Objective ranking comparison between models through leaderboards
Continuous benchmark updates and expansion

3-2) Reference-based Evaluation (NLP Metrics)

An automatic evaluation system that measures similarity by comparing model responses with answers or reference answers (Ground Truth).

Supported Metrics:

BLEU: N-gram based similarity measurement, mainly used for machine translation quality evaluation
ROUGE: Text overlap measurement, standard metric for summarization quality evaluation
METEOR: Machine translation evaluation metric considering semantic similarity
TER (Translation Edit Rate): Translation error rate measurement based on edit distance (*closer to 0 is better)
BERTScore: Semantic similarity measurement using BERT embeddings

Overall Evaluation Features Summary

Evaluation Feature	Category	Main Purpose	Evaluation Target
Evaluation Task	Judge Evaluation	General AI response quality evaluation	Safety, RAG Quality
RAGAs Task	Judge Evaluation	RAG system automatic evaluation	Retrieval quality, Generation quality
RAG Checker	Judge Evaluation	Claim-level RAG precision evaluation	Factuality, Information accuracy
Auto Red-Teaming	Judge Evaluation	Automated security vulnerability verification	Safety, Robustness
Manual Evaluation	Human Evaluation	Rubric-based systematic human evaluation	Quality, Appropriateness, Creativity
Interactive Evaluation	Human Evaluation	Real-time conversational immediate evaluation	Prototype, Exploratory testing
Harness Task	Quantitative Evaluation	Standard benchmark performance measurement	Knowledge, Reasoning, Language understanding
Reference-based	Quantitative Evaluation	NLP metric-based similarity evaluation	Translation, Summarization, Generation quality

Datumo-Concepts

DATUMO Eval Overview

Evaluation Features

Dataset

Model & Agent

Evaluation Task

Metrics

Eval Set

Dashboard

PART 1: Evaluation Methods

Judge Evaluation

Human Evaluation

Quantitative Evaluation

PART 2: Evaluation Features

1. Judge Evaluation

1-1) Evaluation Task

1-2) RAGAs Task

1-3) RAG Checker

1-4) Auto Red-Teaming

2. Human Evaluation

2-1) Manual Evaluation

2-2) Interactive Evaluation

3. Quantitative Evaluation

3-1) Harness Task

3-2) Reference-based Evaluation (NLP Metrics)

Overall Evaluation Features Summary

DATUMO Eval Overview​

Evaluation Features

Dataset

Model & Agent

Evaluation Task

Metrics

Eval Set

Dashboard

PART 1: Evaluation Methods​

Judge Evaluation​

Human Evaluation​

Quantitative Evaluation​

PART 2: Evaluation Features​

1. Judge Evaluation​

1-1) Evaluation Task​

1-2) RAGAs Task​

1-3) RAG Checker​

1-4) Auto Red-Teaming​

2. Human Evaluation​

2-1) Manual Evaluation​

2-2) Interactive Evaluation​

3. Quantitative Evaluation​

3-1) Harness Task​

3-2) Reference-based Evaluation (NLP Metrics)​

Overall Evaluation Features Summary​

DATUMO Eval Overview

PART 1: Evaluation Methods

Judge Evaluation

Human Evaluation

Quantitative Evaluation

PART 2: Evaluation Features

1. Judge Evaluation

1-1) Evaluation Task

1-2) RAGAs Task

1-3) RAG Checker

1-4) Auto Red-Teaming

2. Human Evaluation

2-1) Manual Evaluation

2-2) Interactive Evaluation

3. Quantitative Evaluation

3-1) Harness Task

3-2) Reference-based Evaluation (NLP Metrics)

Overall Evaluation Features Summary