Skip to main content

Evaluation Concepts & Types


Evaluation Concepts and Types Supported by Datumo Eval

In order to systematically assess the quality and reliability of the response of Generative AI services, Datumo Eval provides the following evaluation frames:

Basic Evaluation

This is the most fundamental response quality assessment framework, and assessments are made based on the service's response to the assessment question (Query). It is applicable to most types of service and consists of the following key assessment items.


Safety Evaluation

Evaluate how securely the Generative AI service responds to aggressive prompts that induce "unauthorized output (such as biased speech).


RAG Checker

It supports the evaluation of Retrieval-augmented Generation (RAG)-based models.
It measures accurate information utilization between retrieved documents (Contexts) and model responses, and is an essential frame for Factuality evaluation.

Provides Advanced information utilization evaluation, including Chunk-level analysis and Claim-level F1 Score.


2. Red Teaming

A frame exploring risks in AI systems from an attacker's perspective.
Unlike the usual Rubric-based evaluation, we validate the vulnerability of our model based on attack scenarios.

  • Manual Red Teaming (Strategic Scenario + Worker Engagement Base)
  • Auto Red Teaming (Automated Attack Prompt Generation and Repeated Validation)

Both are supported and are an essential evaluation method for AI service high risk response.


What is Red Teaming?

Red Teaming is a concept originally used in security domains and is a way of **exploring vulnerabilities in systems from the attacker's perspective.

At Datumo Eval
For AI systems (including LLM based services):

  • Create intentional attack prompts
  • Secret risk exploration
  • ** Reflect response processes after finding vulnerabilities**

It operates in the structure of .

Red Teaming can proactively detect challenges that are difficult to detect (complex biases, context-based attacks, etc.) in a real-world production environment of AI services.


What is Factuality Evaluation?

Factuality assessments measure the ability of a model to produce accurate and reliable responses to a given question.

In particular, for the Retrieval-based model (RAG), it is important to evaluate Retrieval performance, Retrieved Context utilization, and accurate Claim configuration within the response in multiple dimensions.

Datumo provides a fine-grained evaluation indicator** based on **Text Decomposition and validates the model's information accuracy on a Claim-level basis.


Evaluation FrameDescription
Basic EvaluationResponse Quality Assessment
RAG CheckerRetrieval-based Accuracy Assessment
Safety EvaluationRisk Factors (Bias, Toxicity, etc.)
Red TeamingRisk Validation Based on Attack Scenario

Evaluation Category

Datumo Eval supports the following categories based on service objectives and targets:

Each category can be viewed in detail on the Separate page for indicator definitions and application cases.
👉 Evaluation Categories