Skip to main content

Evaluation Types

Overview

This document explains the main evaluation types for generative AI. Datumo Eval enables you to execute these evaluation types in various ways through Judgment Eval, Quantitative Eval, and Human Eval features.


Evaluation Framework

When evaluating generative AI models, it's essential to comprehensively examine what the model does well, what risks it poses, and how suitable it is for actual service purposes. Generally, generative AI evaluation is divided into three main perspectives: safety, factuality, and response quality. These three categories form a fundamental framework that covers most service scenarios.

First, Safety evaluation is necessary to ensure models don't generate harmful expressions or violate policies. Second, in situations where there is a "correct answer" or "evidence" such as document-based QA or RAG systems, Factuality evaluation is important to measure how accurately the model's answer aligns with actual information. Third, for generation tasks where there is no clear correct answer, such as summarization, rewriting, or conversational dialogue, Response Quality evaluation is required to assess clarity, logic, and tone & manner.

Datumo Eval enables you to perform these evaluation types in various ways through Judgment Eval, Quantitative Eval, and Human Eval features. Using these tools helps you identify individual models' strengths and vulnerabilities in a balanced manner and systematically establish verification standards needed for service operations.


Safety Evaluation

Safety is an evaluation domain that manages models to prevent generating risky responses that violate policies, contain harmful expressions, or include sensitive information. Public chatbots and customer service systems in particular are likely to receive diverse inputs. In such environments, manually designing attack strategies alone is insufficient to identify potential vulnerabilities. In these situations, Auto Red-Teaming can be used to automatically generate large volumes of attack strategies and scenarios, broadly detecting various patterns that would be difficult for humans to design.

Conversely, when you want to quickly verify specific prompts or individual situations, you can use Single Safety Evaluation to examine safety individually without generating large-scale attacks. This method instantly calculates Datumo Safety's basic metrics (harmfulness, policy violations, personal information exposure, etc.), and when needed, Safety Rubric-based Judgment Eval can be used to perform qualitative judgment according to policy standards.

Additionally, even if responses appear safe in automated attacks or single evaluations, unexpected risky responses may occur in actual conversational context with users. To check for such context-based risks, Human Eval allows people to directly converse with the model and perform final verification according to predefined safety criteria (validation rules). This enables comprehensive inspection of risks in actual service situations that automated evaluation might miss.

Key Use Cases:

  • Preventing inappropriate responses in customer service chatbots
  • Blocking harmful content in educational AI
  • Protecting personal information in financial advisory bots
  • Preventing incorrect advice in medical information systems

Factuality Evaluation

Factuality is an evaluation domain that verifies how accurately a model's response aligns with provided documents (Context) or Ground Truth (correct answer data). Factuality evaluation is particularly important for document-based QA services and RAG systems, conceptually checking how faithfully the model reflected provided evidence (Faithfulness), whether retrieved documents were appropriate (Context Relevancy), and whether answers are factually accurate (Correctness).

For such factuality evaluation in Datumo Eval, when you upload query·document·response datasets to an Evaluation Task, quantitative factuality metrics (Faithfulness, Correctness, Relevancy, etc.) are automatically calculated to quickly grasp overall performance. When Ground Truth is provided, RAG Checker can be used to compare expected answers with model responses at the sentence level, clearly identifying evidence mismatches or hallucination points.

When quantitative evaluation alone makes it difficult to identify error causes, Judgment Eval's RAG Rubric can be used for LLM Judge to qualitatively analyze how evidence was utilized in answers, error types, factuality defects, etc. This helps separately identify issues with Retriever and Generator, deriving directions for improving RAG pipelines.

Key Use Cases:

  • Verifying accuracy of enterprise internal document-based QA systems
  • Evaluating answer reliability in legal/regulatory document-based consulting services
  • Confirming information accuracy in product manual-based customer support chatbots
  • Measuring citation accuracy in academic resource retrieval systems

Response Quality Evaluation

While factuality evaluation measures alignment with correct answers, Response Quality evaluation assesses whether model-generated text is clear, logical, and suitable for its intended purpose and domain. This is particularly important for tasks where "there is no correct answer" (summarization, rewriting, counseling responses, etc.), conceptually checking naturalness, completeness, logical structure, tone & style of model-generated sentences.

When evaluating response quality, Judgment Eval can be used to define custom Rubrics suitable for the domain, with LLM Judge qualitatively evaluating response quality according to those criteria. When more rigorous quality verification is needed, Human Eval allows people to directly review model responses to judge final result appropriateness.

For tasks with comparison targets like summarization or transformation, Quantitative Eval can be used to quantify differences between two responses using text similarity-based metrics like BLEU and ROUGE. This combination of qualitative and quantitative evaluation is widely used for prompt optimization, model version comparison, service quality management, etc.

Key Use Cases:

  • Evaluating brand tone & manner consistency in marketing copy generation AI
  • Measuring preservation of core information in technical document summarization systems
  • Evaluating empathy and professionalism in customer response script generation AI
  • Verifying naturalness and meaning preservation in translation systems

Evaluation Type Summary

If you now understand the evaluation types in Datumo Eval (Safety, Factuality, and Response Quality),
proceed to the next section to explore how each evaluation feature is structured and executed.

If you need hands-on practice, refer to the step-by-step tutorials below: