0. Overview

Overview

Auto Red Teaming is a red teaming service designed to automatically evaluate the safety of Large Language Models (LLMs).
Without requiring manual prompt engineering, it explores a wide range of attack scenarios through strategy-based automated red teaming (STAR-Teaming)
and diagnoses model vulnerabilities using quantitative metrics and actionable dashboard insights.

1. Key Benefits

Auto Red Teaming addresses the following challenges:

Cost, bias, and reproducibility issues of manual red teaming
Fragmented testing instead of systematic, taxonomy-based safety evaluation
Lack of objective metrics for cross-model comparison

Benefit	Description
Continuously Updated Benchmark	Provides the competitive Datumo Safety Benchmark, updated quarterly
Rigorous Evaluation	Consistent Safe / Unsafe judgments using rubric-based judges per Risk Taxonomy
Fully Automated Red Teaming	Automatically applies 100+ attack strategies to identify vulnerabilities and deliver dashboard insights

2. Technical Differentiation

STAR-Teaming–based Automated Red Teaming

Auto Red Teaming internally leverages the STAR-Teaming (Strategy-based Teaming for Adversarial Robustness) methodology.

Component	Description
Strategy Pool	100+ attack strategies including Jailbreak, Role-play, Multi-turn, and more
Seed-based Generation	Derives diverse attack prompts per strategy from a single seed
Adaptive Loop	Learns from previous attempts and automatically optimizes subsequent strategies
Parallel Evaluation	Evaluates multiple target models simultaneously under identical conditions

3. Core Concepts

Evaluation Unit Structure

Concept	Description
Task	Top-level container representing an evaluation objective (e.g., model release validation)
Attack Set	Execution unit composed of Dataset + Target Model + configuration
Benchmark Dataset	Seed datasets categorized according to the Risk Taxonomy
Seed	Base scenario unit used to generate attack prompts

Judgment System

Term	Description
Safe	The model refuses or safely responds to harmful requests
Unsafe	The model generates harmful outputs (Score 8–10)

4. System Architecture

Auto Red Teaming performs automated red teaming through collaboration among four AI modules.

┌─────────────────────────────────────────────────────────────────┐
│                      Auto Red Teaming Engine                    │
│                                                                 │
│  [Attack Generator] → [Target Model/Agent] → [Scorer] → [Judge]      │
│         │                                            │          │
│         └──────────────► [Summarizer] ◄──────────────┘          │
└─────────────────────────────────────────────────────────────────┘

Module	Role
Attack Generator	Generates attack prompts from Seeds and Strategies
Scorer	Assigns a harmfulness score from 0 to 10
Judge	Produces the final Safe/Unsafe decision based on Scorer output
Summarizer	Analyzes success/failure patterns and recommends next strategies

The Target Model/Agent represents the evaluation subject and may include LLMs, agents, or full services.

Evaluation Flow

Task Creation
↓
Attack Set Configuration
↓
Automated Red Teaming Execution
↓
Dashboard Analysis

Iterative Evaluation Loop

Attack Generator creates attack prompts
Target Model responses are collected
Scorer computes harmfulness scores
Judge determines Safe or Unsafe
If Safe → Summarizer updates strategies and retries (loop back to step 1)

Termination Conditions

An Unsafe judgment is produced
Maximum Red Teaming Runs are reached

Judgment Criteria

Category	Score
Unsafe	8–10
Moderately Unsafe	5–7
Moderately Safe	3–4
Safe	0–2

7. Key Dashboard Metrics

Metric	Meaning
ASR (Attack Success Rate)	Ratio of successful attacks (lower is safer)
Safety Score	Average harmfulness score
Loop Count	Actual number of attempts per seed

0. Overview​

1. Key Benefits​

2. Technical Differentiation​

STAR-Teaming–based Automated Red Teaming​

3. Core Concepts​

Evaluation Unit Structure​

Judgment System​

4. System Architecture​

Evaluation Flow​

Iterative Evaluation Loop​

Termination Conditions​

Judgment Criteria​

7. Key Dashboard Metrics​