Evaluation Framework
DATUMO Eval Overview
DATUMO Eval is a comprehensive evaluation platform for systematically verifying the quality and safety of generative AI services. It allows for multi-faceted measurement and improvement of AI model performance through various evaluation methods and specialized functions.
PART 1: Evaluation Methods
DATUMO Eval is largely composed of three evaluation methods:
Judge Evaluation
Automated quality evaluation using an AI Judge. → Secures both human-level judgment and large-scale automation. → Currently supports single-turn based evaluation, with multi-turn evaluation planned for the future.
Human Evaluation
Subjective quality verification using human judgment. → Evaluates elements that are difficult to measure numerically, such as nuance, creativity, and appropriateness.
Quantitative Evaluation
Numerical performance measurement based on objective indicators. → Comparison between models through reproducible and standardized benchmarks.
Each method can be used independently or combined complementarily to enable highly reliable evaluation.
PART 2: Evaluation Features
Below are the specific evaluation features provided by DATUMO Eval. Each feature is optimized for a specific evaluation method and purpose.
1. Judge Evaluation
This method automatically evaluates the response quality and stability of generative AI using an AI Judge model. It is characterized by its ability to mimic human judgment while enabling large-scale automation.
1-1) Evaluation Task
This is the most general evaluation framework that performs evaluation based on the service's response to an evaluation query. It is applicable to most service types and consists of the following main evaluation items:
Main Evaluation Areas:
- Safety Evaluation: Evaluates how safely a generative AI service responds to aggressive prompts intended to induce "unauthorized outputs (e.g., biased remarks)."
- RAG Quality Evaluation: Evaluates the appropriateness of information utilization for responses generated based on document retrieval in a RAG (Retrieval-Augmented Generation) system.
Detailed Features:
- Evaluation Task Management: Create and execute evaluation processes on a task basis.
- Evaluation Set Management: Systematically manage evaluation datasets, with the ability to stop/restart evaluation for each set.
- Evaluation Dashboard: Visually check overall evaluation results and perform comparative analysis between tasks.
- Detailed Result Analysis: Check detailed performance results by question and item to derive improvement points.
1-2) RAGAs Task
This is an automatic evaluation feature that utilizes RAGAs (Retrieval-Augmented Generation Assessment) metrics.
Key Features:
- Measures the retrieval and generation quality of a RAG system with standardized indicators.
- Multi-dimensionally evaluates the relationship between the retrieved context and the generated response.
- Enables quick performance assessment through automated numerical evaluation.
Evaluation Metrics:
-
Answer Correctness: Measures answer correctness compared to ground truth as a combination of factuality and semantic similarity.
-
Response Relevancy: Scores the relevancy of the answer according to the given question. Answers with incomplete, redundant or unnecessary information is penalized. Score can range from 0 to 1 with 1 being the best.
-
Semantic Similarity: Scores the semantic similarity of ground truth with generated answer. cross encoder score is used to quantify semantic similarity
-
Context Entity Recall: Calculates recall based on entities present in ground truth and context. Let CN be the set of entities present in context, GN be the set of entities present in the ground truth.
-
LLM Context Precision With GT: Average Precision is a metric that evaluates whether all of the relevant items selected by the model are ranked higher or not.
-
LLM Context Recall: Estimates context recall by estimating TP and FN using annotated answer and retrieved context.
-
Factual Correctness: FactualCorrectness is a metric class that evaluates the factual correctness of responses generated by a language model. It uses claim decomposition and natural language inference (NLI) to verify the claims made in the responses against reference texts.
-
Faithfulness: The Faithfulness metric measures how factually consistent a response is with the retrieved context. It ranges from 0 to 1, with higher scores indicating better consistency.
-
Noise Sensitivity: NoiseSensitivity measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents. The score ranges from 0 to 1, with lower values indicating better performance. Noise sensitivity is computed using the user_input, reference, response, and the retrieved_contexts.
-
Answer Accuracy: Answer Accuracy measures the agreement between a model’s response and a reference ground truth for a given question. This is done via two distinct "LLM-as-a-judge" prompts that each return a rating (0, 2, or 4).
-
Context Relevance: Context Relevance evaluates whether the retrieved_contexts (chunks or passages) are pertinent to the user_input. This is done via two independent "LLM-as-a-judge" prompt calls that each rate the relevance on a scale of 0, 1, or 2.
-
Response Groundedness: Response Groundedness measures how well a response is supported or "grounded" by the retrieved contexts. It assesses whether each claim in the response can be found, either wholly or partially, in the provided contexts.
1-3) RAG Checker
Evaluates the response of a RAG system based on the Expected Response (ER) for an evaluation question. It measures whether the claims within the ER, which serves as a model answer, are included in the retrieved document (Context) and the RAG system's response, thereby evaluating the performance of the RAG system's Generator and Retriever modules respectively.
Evaluation Mechanism:
- Extracts and decomposes claims from the Expected Response.
- Verifies if the corresponding claim is included in the retrieved document (Context) → Retriever performance evaluation.
- Verifies if the corresponding claim is reflected in the RAG system's response → Generator performance evaluation.
1-4) Auto Red-Teaming
A system that automatically generates adversarial prompts using a library of attack scenarios to verify the safety and vulnerabilities of an AI model.
Key Features:
- Automatically applies various attack strategies (Jailbreak, Prompt Injection, etc.).
- Detects model vulnerabilities through repetitive adversarial testing.
- Identifies subtle risks that are difficult to discover with standard Safety Evaluation.
2. Human Evaluation
A method where humans directly review and evaluate AI responses to judge elements that are difficult to measure numerically.
2-1) Manual Evaluation
A feature where evaluators systematically evaluate AI responses based on a predefined rubric.
Key Features:
- Ensures consistency through a clear evaluation rubric.
- Comparison of results and reliability analysis among multiple evaluators.
- Allows for scoring by detailed items and writing comments.
- Optimized for evaluating the quality of complex responses that require qualitative judgment.
2-2) Interactive Evaluation
An interactive system for evaluating response quality in real-time while conversing with an AI model.
Key Features:
- Check and evaluate responses immediately after entering a query.
- Simple instant feedback such as Good/Bad.
- Ability to write Ground Truth (GT) and suggest improvements.
- Useful for rapid prototype testing and exploratory evaluation.
3. Quantitative Evaluation
A method for measuring model performance using objective and reproducible numerical indicators. It enables objective comparison and benchmarking between models using standardized metrics.
3-1) Harness Task
A system that measures the performance of AI models using standardized benchmark datasets and compares them via a leaderboard.
Supported Datasets:
- HRM8K: Evaluates mathematical reasoning and problem-solving abilities.
- KMMLU: Korean Multi-domain Multi-task Language Understanding evaluation.
- KOBEST: Korean Benchmark Suite for Natural Language Understanding.
- Selected subsets of other global standard benchmarks.
Key Features:
- Supports standard benchmarks widely recognized in academia and industry.
- Enables large-scale testing through automated evaluation.
- Objective ranking comparison between models via a leaderboard.
- Continuous benchmark updates and expansion.
3-2) Reference-based Evaluation (NLP Metrics)
An automated evaluation system that measures similarity by comparing model responses with correct or reference answers (Ground Truth).
Supported Metrics:
- BLEU: N-gram based similarity measurement, mainly used for machine translation quality evaluation.
- ROUGE: Text overlap measurement, a standard metric for summarization quality evaluation.
- METEOR: Machine translation evaluation metric that considers semantic similarity.
- TER (Translation Edit Rate): Measures translation error rate based on edit distance (closer to 0 is better).
- BERTScore: Measures semantic similarity using BERT embeddings.
Summary of All Evaluation Features
| Evaluation Feature | Category | Main Purpose | Evaluation Target |
|---|---|---|---|
| Evaluation Task | Judge Evaluation | General AI response quality evaluation | Safety, RAG Quality |
| RAGAs Task | Judge Evaluation | Automated evaluation of RAG systems | Retrieval Quality, Generation Quality |
| RAG Checker | Judge Evaluation | Precise Claim-level RAG evaluation | Factuality, Information Accuracy |
| Auto Red-Teaming | Judge Evaluation | Automated security vulnerability verification | Safety, Robustness |
| Manual Evaluation | Human Evaluation | Systematic human evaluation based on rubrics | Quality, Appropriateness, Creativity |
| Interactive Evaluation | Human Evaluation | Real-time conversational instant evaluation | Prototypes, Exploratory Testing |
| Harness Task | Quantitative Evaluation | Standard benchmark performance measurement | Knowledge, Reasoning, Language Understanding |
| Reference-based | Quantitative Evaluation | NLP metric-based similarity evaluation | Translation, Summarization, Generation Quality |