Skip to main content

Datumo-Concepts


DATUMO Eval Overview

DATUMO Eval is a comprehensive evaluation platform for systematically verifying the quality and safety of generative AI services. Through various evaluation methods and specialized features, you can measure and improve AI model performance from multiple perspectives.


DATUMO Eval is a comprehensive evaluation platform designed to systematically verify the quality and safety of generative AI services.
It provides multiple evaluation methods and specialized capabilities to assess and improve AI model performance from various perspectives.

PART 1: Evaluation Methods

DATUMO Eval consists of three main evaluation methods:

Judge Evaluation

Automated quality evaluation using AI Judge → Achieve both human-level judgment and large-scale automation → Currently supports single-turn based evaluation, multi-turn evaluation coming soon

Human Evaluation

Subjective quality verification using human judgment → Evaluate elements that are difficult to measure numerically, such as nuance, creativity, and appropriateness

Quantitative Evaluation

Numerical performance measurement based on objective metrics → Objective comparison between models through reproducible and standardized benchmarks

Each method can be used independently or combined complementarily for highly reliable evaluation.


PART 2: Evaluation Features

Below are the specific evaluation features provided by DATUMO Eval. Each feature is optimized for specific evaluation methods and purposes.


1. Judge Evaluation

A method that automatically evaluates the response quality and stability of generative AI using AI Judge models. It is characterized by mimicking human judgment while enabling large-scale automation.

1-1) Evaluation Task

The most versatile evaluation framework that performs evaluation based on service responses to evaluation queries. Applicable to most service types, it consists of the following main evaluation items:

Main Evaluation Areas:

  • Safety Evaluation: Evaluates how safely generative AI services respond to adversarial prompts that induce "unauthorized outputs (biased statements, etc.)".
  • RAG Quality Evaluation: Evaluates the appropriateness of information utilization for responses generated by RAG (Retrieval-Augmented Generation) systems based on document retrieval.

Detailed Features:

  • Evaluation Task Management: Create and execute evaluation processes in Task units
  • Evaluation Set Management: Systematically manage evaluation datasets, with ability to stop/restart evaluation per Set
  • Evaluation Dashboard: Visually check overall evaluation results and comparative analysis between Tasks
  • Detailed Result Analysis: Check detailed performance results by question and item, and derive improvements

1-2) RAGAs Task

An automatic evaluation feature using RAGAs (Retrieval-Augmented Generation Assessment) metrics.

Main Features:

  • Measure retrieval quality and generation quality of RAG systems using standardized metrics
  • Multi-dimensional evaluation of the relationship between retrieved context and generated responses
  • Quick performance assessment through automated numerical evaluation

Evaluation Metrics:

  • Answer Correctness: Evaluates how accurately the generated response matches the Ground Truth. Considers both Factuality and Semantic Similarity.

  • Response Relevancy: Evaluates relevance to the question. Scores decrease if the response is incomplete or contains unnecessary information. Scores range from 0-1, closer to 1 is better.

  • Semantic Similarity: Measures how semantically similar the generated response is to the Ground Truth. A metric that numerically expresses semantic alignment.

  • Context Entity Recall: Calculates recall based on how well entities in the Ground Truth are included in the retrieved Context.

  • LLM Context Precision With GT: Evaluates how highly relevant items rank among the contexts selected by the model. Measures average precision of relevant information.

  • LLM Context Recall: Evaluates how sufficiently the retrieved context includes relevant information based on the given question and Ground Truth.

  • Factual Correctness: A metric that evaluates the factual accuracy of generated responses. Decomposes claims in the response and compares them with reference documents to determine the truth of each claim.

  • Faithfulness: Measures how factually consistent the response is with the retrieved Context. Values closer to 1 indicate more faithful responses to the context.

  • Noise Sensitivity: Measures the frequency at which the system generates incorrect responses when using relevant or irrelevant documents. Lower values indicate more robust (stable) performance against noise.

  • Answer Accuracy: Evaluates how well the model response matches the Ground Truth for a question. Uses "LLM-as-a-judge" evaluation method for scoring.

  • Context Relevance: Evaluates how closely the retrieved Context relates to the user's question. Higher relevance receives better scores.

  • Response Groundedness: Evaluates how well the model's response is grounded in the provided context. Higher scores when each claim in the response can find evidence in the context.

1-3) RAG Checker

Evaluates RAG system responses based on ER (Expected Response) for evaluation queries. Measures whether claims in the ER (which serves as a model answer) are included in the retrieved documents (Context) and RAG system responses to evaluate the performance of both the Generator module and Retriever module of the RAG system separately.

Evaluation Mechanism:

  • Extract and decompose claims in the Expected Response
  • Verify if claims are included in retrieved documents (Context) → Retriever Performance Evaluation
  • Verify if claims are reflected in RAG system responses → Generator Performance Evaluation

1-4) Auto Red-Teaming

A system that automatically generates adversarial prompts using an attack scenario library and verifies the safety and vulnerabilities of AI models.

Main Features:

  • Automatically apply various attack strategies (Jailbreak, Prompt Injection, etc.)
  • Detect model vulnerabilities through repeated adversarial testing
  • Identify subtle risks that are difficult to discover through Safety Evaluation alone

2. Human Evaluation

A method where humans directly review and evaluate AI responses, judging elements that are difficult to measure numerically.

2-1) Manual Evaluation

A feature where evaluators systematically evaluate AI responses based on predefined evaluation criteria (Rubric).

Main Features:

  • Ensure consistency through clear evaluation rubrics
  • Compare results between multiple evaluators and analyze reliability
  • Score by detailed items and write comments
  • Optimized for complex response quality evaluation requiring qualitative judgment

2-2) Interactive Evaluation

An interactive system that evaluates response quality immediately while conversing with AI models in real-time.

Main Features:

  • Check and evaluate responses immediately after query input
  • Simple immediate feedback such as Good/Bad
  • Write Ground Truth (GT) and suggest improvements
  • Useful for rapid prototype testing and exploratory evaluation

3. Quantitative Evaluation

A method that measures model performance through objective and reproducible numerical metrics. Enables objective comparison and benchmarking between models using standardized metrics.

3-1) Harness Task

A system that measures AI model performance using standardized benchmark datasets and compares them through leaderboards.

Supported Datasets:

  • HRM8K: Mathematical reasoning and problem-solving ability evaluation
  • KMMLU: Korean multi-domain knowledge comprehension evaluation (Korean Multi-domain Multi-task Language Understanding)
  • KOBEST: Korean natural language understanding benchmark (Korean Benchmark Suite for Natural Language Understanding)
  • Selected subsets of other global standard benchmarks

Main Features:

  • Support for standard benchmarks widely recognized in academia and industry
  • Large-scale testing through automated evaluation
  • Objective ranking comparison between models through leaderboards
  • Continuous benchmark updates and expansion

3-2) Reference-based Evaluation (NLP Metrics)

An automatic evaluation system that measures similarity by comparing model responses with answers or reference answers (Ground Truth).

Supported Metrics:

  • BLEU: N-gram based similarity measurement, mainly used for machine translation quality evaluation
  • ROUGE: Text overlap measurement, standard metric for summarization quality evaluation
  • METEOR: Machine translation evaluation metric considering semantic similarity
  • TER (Translation Edit Rate): Translation error rate measurement based on edit distance (*closer to 0 is better)
  • BERTScore: Semantic similarity measurement using BERT embeddings

Overall Evaluation Features Summary

Evaluation FeatureCategoryMain PurposeEvaluation Target
Evaluation TaskJudge EvaluationGeneral AI response quality evaluationSafety, RAG Quality
RAGAs TaskJudge EvaluationRAG system automatic evaluationRetrieval quality, Generation quality
RAG CheckerJudge EvaluationClaim-level RAG precision evaluationFactuality, Information accuracy
Auto Red-TeamingJudge EvaluationAutomated security vulnerability verificationSafety, Robustness
Manual EvaluationHuman EvaluationRubric-based systematic human evaluationQuality, Appropriateness, Creativity
Interactive EvaluationHuman EvaluationReal-time conversational immediate evaluationPrototype, Exploratory testing
Harness TaskQuantitative EvaluationStandard benchmark performance measurementKnowledge, Reasoning, Language understanding
Reference-basedQuantitative EvaluationNLP metric-based similarity evaluationTranslation, Summarization, Generation quality