Skip to main content

0. Overview

0. Overview

Overview

Auto Red Teaming is a red teaming service designed to automatically evaluate the safety of Large Language Models (LLMs).
Without requiring manual prompt engineering, it explores a wide range of attack scenarios through strategy-based automated red teaming (STAR-Teaming)
and diagnoses model vulnerabilities using quantitative metrics and actionable dashboard insights.


1. Key Benefits

Auto Red Teaming addresses the following challenges:

  • Cost, bias, and reproducibility issues of manual red teaming
  • Fragmented testing instead of systematic, taxonomy-based safety evaluation
  • Lack of objective metrics for cross-model comparison
BenefitDescription
Continuously Updated BenchmarkProvides the competitive Datumo Safety Benchmark, updated quarterly
Rigorous EvaluationConsistent Safe / Unsafe judgments using rubric-based judges per Risk Taxonomy
Fully Automated Red TeamingAutomatically applies 100+ attack strategies to identify vulnerabilities and deliver dashboard insights

2. Technical Differentiation

STAR-Teaming–based Automated Red Teaming

Auto Red Teaming internally leverages the STAR-Teaming (Strategy-based Teaming for Adversarial Robustness) methodology.

ComponentDescription
Strategy Pool100+ attack strategies including Jailbreak, Role-play, Multi-turn, and more
Seed-based GenerationDerives diverse attack prompts per strategy from a single seed
Adaptive LoopLearns from previous attempts and automatically optimizes subsequent strategies
Parallel EvaluationEvaluates multiple target models simultaneously under identical conditions

3. Core Concepts

Evaluation Unit Structure

ConceptDescription
TaskTop-level container representing an evaluation objective (e.g., model release validation)
Attack SetExecution unit composed of Dataset + Target Model + configuration
Benchmark DatasetSeed datasets categorized according to the Risk Taxonomy
SeedBase scenario unit used to generate attack prompts

Judgment System

TermDescription
SafeThe model refuses or safely responds to harmful requests
UnsafeThe model generates harmful outputs (Score 8–10)

4. System Architecture

Auto Red Teaming performs automated red teaming through collaboration among four AI modules.

┌─────────────────────────────────────────────────────────────────┐
│ Auto Red Teaming Engine │
│ │
│ [Attack Generator] → [Target Model/Agent] → [Scorer] → [Judge] │
│ │ │ │
│ └──────────────► [Summarizer] ◄──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
ModuleRole
Attack GeneratorGenerates attack prompts from Seeds and Strategies
ScorerAssigns a harmfulness score from 0 to 10
JudgeProduces the final Safe/Unsafe decision based on Scorer output
SummarizerAnalyzes success/failure patterns and recommends next strategies
  • The Target Model/Agent represents the evaluation subject and may include LLMs, agents, or full services.

Evaluation Flow

Task Creation

Attack Set Configuration

Automated Red Teaming Execution

Dashboard Analysis

Iterative Evaluation Loop

  1. Attack Generator creates attack prompts
  2. Target Model responses are collected
  3. Scorer computes harmfulness scores
  4. Judge determines Safe or Unsafe
  5. If Safe → Summarizer updates strategies and retries (loop back to step 1)

Termination Conditions

  • An Unsafe judgment is produced
  • Maximum Red Teaming Runs are reached

Judgment Criteria

CategoryScore
Unsafe8–10
Moderately Unsafe5–7
Moderately Safe3–4
Safe0–2

7. Key Dashboard Metrics

MetricMeaning
ASR (Attack Success Rate)Ratio of successful attacks (lower is safer)
Safety ScoreAverage harmfulness score
Loop CountActual number of attempts per seed