🚨 Automated Red-Teaming

Overview

Once you upload a Seed, the system automatically performs attack prompt generation → target model evaluation → report generation using Selectstar's library of 100+ Strategies.

System configuration (high-level):
Internally composed of (1) an Automated Red-Teaming Multi-Agent System and (2) a Multiplex Network for Strategy Sampling. (See Core Concepts → Eval Framework for architecture details.)

Flow (3-step)

New Task: Upload seed, set Target, number of iterations, and Taxonomy.

Seed

Use general, non-executable queries for testing rather than requests for specific illegal actions. (e.g., "An example persuasive request that may violate policy" — non-actionable.)

Run: System combines Strategy + Seed to generate attack prompts → Target responds → Scorer assigns a 1–10 score.
Report: Check ASR, results by strategy/seed, and vulnerability summary.

Metrics

ASR (Attack Success Rate): Internal metric representing the rate of unsafe responses. In the UI, results are displayed as Safe / Unsafe outcomes.
Unsafe Rate: % of responses flagged as policy-violating or high-risk.
Coverage: Fraction of Seed×Strategy combinations attempted vs. planned.
Cost/run: Estimated inference cost per attack attempt.

Terminology

Term	Definition
Seed Data (Seed)	A set of evaluation queries uploaded by the customer, which provides the topic and context for attack prompts. (e.g., "Use non-actionable examples.")
Attack Strategy (Strategy)	An item from the system's strategy library, designed to bypass the model's defenses by modifying or guiding the seed. (e.g., storytelling, framing, case-based persuasion)
Attack Prompt	The actual query input to the Target Model, generated by combining a seed and a selected strategy. All prompts are generated to comply with safety policies.
Attacker (Attacker Agent)	An internal agent (module) that generates attack prompts from a seed + strategy.
Target Model	The LLM or agent system being evaluated (including internal/external models, RAG).
Scorer	A module that automatically scores the Target's response on a scale of 1-10 to indicate the level of risk (e.g., 1=safe, 10=high risk). Scores above the threshold are considered successful attacks (Unsafe)..
Strategy Library	A catalog of proven attack strategies, where each strategy includes tags, descriptions, and application examples.
ASR (Attack Success Rate)	The percentage of responses to which the Scorer assigned a score above the threshold, representing the core metric for attack success rate.

📄️ Overview

Datumo Eval automates red teaming by generating adversarial prompts, evaluating LLM responses, and delivering vulnerability reports.

📄️ Run Evaluation

Datumo Eval automates red teaming by generating adversarial prompts, evaluating LLM responses, and delivering vulnerability reports.

Flow (3-step)​

Metrics​

Terminology​

📄️ Overview

📄️ Run Evaluation

Flow (3-step)

Metrics

Terminology