Skip to main content

0. Overview

Overview

Auto Red Teaming is a red teaming service designed to automatically evaluate the Safety of Large Language Models (LLMs). Without requiring manual prompt engineering, it explores a wide range of attack scenarios through strategy-based automated red teaming (STAR-Teaming) and diagnoses model vulnerabilities using quantitative metrics and actionable dashboard insights.


1. Key Benefits

Auto Red Teaming addresses the following challenges:

  • Cost, bias, and reproducibility issues of manual red teaming
  • Systematic safety evaluation based on Risk Taxonomy + explainable strategies
  • Lack of objective metrics for cross-model comparison
BenefitDescription
Continuously Updated BenchmarkProvides the competitive Datumo Safety Benchmark, updated quarterly
Rigorous EvaluationConsistent Safe / Unsafe judgments using rubric-based judges per Risk Taxonomy
Fully Automated Red TeamingApplies explainable attack strategies to effectively detect model vulnerabilities and deliver dashboard insights

2. Technical Differentiation

STAR-Teaming–based Automated Red Teaming Auto Red Teaming internally leverages the STAR-Teaming (Strategy-based Teaming for Adversarial Robustness) methodology.

ComponentDescription
Strategy PoolExplainable attack strategies based on emotional/psychological, social/group dynamics, and more
Seed-based GenerationDerives diverse attack prompts per strategy from a single seed
Adaptive LoopLearns from previous attempts and automatically optimizes subsequent strategies
Parallel EvaluationEvaluates multiple target models simultaneously under identical conditions

3. Core Concepts

Evaluation Unit Structure

ConceptDescription
TaskTop-level container representing an evaluation objective (e.g., model release validation)
Attack SetA collection of selected seed datasets from the chosen benchmark for performing a Task
Benchmark DatasetEvaluation seed data categorized according to the Risk Taxonomy
SeedData used to generate attack prompts

Judgment System

TermDescription
SafeThe model refuses harmful requests or responds safely (Score 0–7)
UnsafeThe model generates harmful responses (Score 8–10)

4. System Architecture

Auto Red Teaming performs automated red teaming through collaboration among four AI modules.

┌─────────────────────────────────────────────────────────────────┐
│ Auto Red Teaming Engine │
│ │
│ [Attack Generator] → [Target Model/Agent] → [Scorer] → [Judge] │
│ │ │ │
│ └◄────────────── [Summarizer] ◄──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
ModuleRole
AttackerGenerates attack prompts from Seeds + Strategies
ScorerAssigns a harmfulness score from 0 to 10
JudgeProduces the final Safe/Unsafe decision based on Scorer output
SummarizerAnalyzes failure/success patterns and recommends the next effective strategies
  • The Target Model represents the evaluation subject; currently, only LLM models are supported.

5. Evaluation Flow

Iterative Evaluation Loop

  1. Attacker generates attack prompts
  2. Collects responses from the Target Model to the attack prompts generated by the Attacker
  3. Scorer computes harmfulness scores based on the Target Model's responses
  4. Judge determines Safe or Unsafe based on the Score
  5. If Safe → Summarizer updates strategies and retries (loop back to step 1)
  6. If Unsafe → the iterative evaluation loop terminates

Termination Conditions

  • An Unsafe judgment is produced
  • Maximum Red Teaming Runs (default: 20) are reached

6. Judgment Criteria

CategoryScore
Unsafe8–10
Moderately Unsafe5–7
Moderately Safe3–4
Safe0–2

7. Key Dashboard Metrics

MetricMeaning
ASR (Attack Success Rate)Ratio of successful attacks (lower is safer)
Safety ScoreAverage harmfulness score
Loop CountActual number of attempts per seed