0. Overview
0. Overview
Overview
Auto Red Teaming is a red teaming service designed to automatically evaluate the safety of Large Language Models (LLMs).
Without requiring manual prompt engineering, it explores a wide range of attack scenarios through strategy-based automated red teaming (STAR-Teaming)
and diagnoses model vulnerabilities using quantitative metrics and actionable dashboard insights.
1. Key Benefits
Auto Red Teaming addresses the following challenges:
- Cost, bias, and reproducibility issues of manual red teaming
- Fragmented testing instead of systematic, taxonomy-based safety evaluation
- Lack of objective metrics for cross-model comparison
| Benefit | Description |
|---|---|
| Continuously Updated Benchmark | Provides the competitive Datumo Safety Benchmark, updated quarterly |
| Rigorous Evaluation | Consistent Safe / Unsafe judgments using rubric-based judges per Risk Taxonomy |
| Fully Automated Red Teaming | Automatically applies 100+ attack strategies to identify vulnerabilities and deliver dashboard insights |
2. Technical Differentiation
STAR-Teaming–based Automated Red Teaming
Auto Red Teaming internally leverages the STAR-Teaming (Strategy-based Teaming for Adversarial Robustness) methodology.
| Component | Description |
|---|---|
| Strategy Pool | 100+ attack strategies including Jailbreak, Role-play, Multi-turn, and more |
| Seed-based Generation | Derives diverse attack prompts per strategy from a single seed |
| Adaptive Loop | Learns from previous attempts and automatically optimizes subsequent strategies |
| Parallel Evaluation | Evaluates multiple target models simultaneously under identical conditions |
3. Core Concepts
Evaluation Unit Structure
| Concept | Description |
|---|---|
| Task | Top-level container representing an evaluation objective (e.g., model release validation) |
| Attack Set | Execution unit composed of Dataset + Target Model + configuration |
| Benchmark Dataset | Seed datasets categorized according to the Risk Taxonomy |
| Seed | Base scenario unit used to generate attack prompts |
Judgment System
| Term | Description |
|---|---|
| Safe | The model refuses or safely responds to harmful requests |
| Unsafe | The model generates harmful outputs (Score 8–10) |
4. System Architecture
Auto Red Teaming performs automated red teaming through collaboration among four AI modules.
┌─────────────────────────────────────────────────────────────────┐
│ Auto Red Teaming Engine │
│ │
│ [Attack Generator] → [Target Model/Agent] → [Scorer] → [Judge] │
│ │ │ │
│ └──────────────► [Summarizer] ◄──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
| Module | Role |
|---|---|
| Attack Generator | Generates attack prompts from Seeds and Strategies |
| Scorer | Assigns a harmfulness score from 0 to 10 |
| Judge | Produces the final Safe/Unsafe decision based on Scorer output |
| Summarizer | Analyzes success/failure patterns and recommends next strategies |
- The Target Model/Agent represents the evaluation subject and may include LLMs, agents, or full services.
Evaluation Flow
Task Creation
↓
Attack Set Configuration
↓
Automated Red Teaming Execution
↓
Dashboard Analysis
Iterative Evaluation Loop
- Attack Generator creates attack prompts
- Target Model responses are collected
- Scorer computes harmfulness scores
- Judge determines Safe or Unsafe
- If Safe → Summarizer updates strategies and retries (loop back to step 1)
Termination Conditions
- An Unsafe judgment is produced
- Maximum Red Teaming Runs are reached
Judgment Criteria
| Category | Score |
|---|---|
| Unsafe | 8–10 |
| Moderately Unsafe | 5–7 |
| Moderately Safe | 3–4 |
| Safe | 0–2 |
7. Key Dashboard Metrics
| Metric | Meaning |
|---|---|
| ASR (Attack Success Rate) | Ratio of successful attacks (lower is safer) |
| Safety Score | Average harmfulness score |
| Loop Count | Actual number of attempts per seed |