Evaluation Features

Overview

This document serves as a guide to conceptually understand Judge Evaluation, Human Evaluation, and Quantitative Evaluation features.

Evaluation Modes

Datumo Eval consists of three main evaluation modes. Each evaluation mode can be used independently or combined to verify model quality from multiple perspectives.

1. Judge Evaluation Judgment Evaluation

An AI Judge model automatically analyzes Query–Response pairs to assess quality. It enables large-scale automation and is designed to emulate human-level judgment as closely as possible.

Evaluation Task is Datumo Eval's primary judge evaluation feature. It adopts the LLM-as-a-Judge approach where LLM-generated responses are evaluated by an LLM. Since LLM evaluation criteria are not yet standardized across the industry, users can directly select a Judge Model that aligns with their evaluation criteria.

When multiple Target Models are evaluated using the same Judge Model, fair comparison becomes possible. When executed with different Judge Models, bias due to differing criteria can also be analyzed. Evaluation is conducted on a Task basis, and through Evaluation Set management and dashboard comparisons, comprehensive model performance can be verified.
RAGAs Task is an automated evaluation feature for RAG systems, measuring the relationship between retrieved context and generated responses across various metrics.
RAG Checker performs Claim-level fact verification based on Expected Responses (ground truth). It measures whether Claims are included in the context and reproduced in the model response separately, enabling precise analysis of information omission and distortion issues in RAG systems.
Auto Red-Teaming automatically generates various attack prompts to iteratively detect potential safety violations in models. It can identify vulnerabilities that are difficult to reveal through standard Safety evaluations alone.

2. Human Evaluation Human Evaluation

Human evaluation is a method where people directly judge response quality, used for assessing nuance, creativity, and expression quality that are difficult to automate.

In Manual Evaluation, evaluators review responses item-by-item based on predefined rubrics and assign scores for each detailed criterion.
Interactive Evaluation involves directly conversing with the model while immediately leaving Good/Bad feedback or drafting Ground Truth.

3. Quantitative Evaluation Quantitative Evaluation

Quantitative evaluation objectively compares and analyzes model performance using standard benchmark datasets and NLP-based metrics.

Harness Task evaluates models' knowledge, reasoning, and problem-solving abilities based on standard benchmarks such as HRM8K, KMMLU, and KOBEST. Results are provided in leaderboard format for objective comparison.
Reference-based Evaluation quantitatively measures similarity by comparing model responses to reference answers (Ground Truth). It evaluates translation, summarization, and document generation quality using NLP metrics such as BLEU, ROUGE, METEOR, TER, and BERTScore.