Skip to main content

🚨 Automated Red-Teaming

Overview

Once you upload a Seed, the system automatically performs attack prompt generation β†’ target model evaluation β†’ report generation using Selectstar's library of 100+ Strategies.

System configuration (high-level):
Internally composed of (1) an Automated Red-Teaming Multi-Agent System and (2) a Multiplex Network for Strategy Sampling. (See Core Concepts β†’ Eval Framework for architecture details.)

Flow (3-step)​

  1. New Task: Upload seed, set Target, number of iterations, and Taxonomy.
Seed

Use general, non-executable queries for testing rather than requests for specific illegal actions. (e.g., "An example persuasive request that may violate policy" β€” non-actionable.)

  1. Run: System combines Strategy + Seed to generate attack prompts β†’ Target responds β†’ Scorer assigns a 1–10 score.
  2. Report: Check ASR, results by strategy/seed, and vulnerability summary.

Metrics​

  • ASR (Attack Success Rate): Internal metric representing the rate of unsafe responses. In the UI, results are displayed as Safe / Unsafe outcomes.
  • Unsafe Rate: % of responses flagged as policy-violating or high-risk.
  • Coverage: Fraction of SeedΓ—Strategy combinations attempted vs. planned.
  • Cost/run: Estimated inference cost per attack attempt.

Terminology​

TermDefinition
Seed Data (Seed)A set of evaluation queries uploaded by the customer, which provides the topic and context for attack prompts. (e.g., "Use non-actionable examples.")
Attack Strategy (Strategy)An item from the system's strategy library, designed to bypass the model's defenses by modifying or guiding the seed. (e.g., storytelling, framing, case-based persuasion)
Attack PromptThe actual query input to the Target Model, generated by combining a seed and a selected strategy. All prompts are generated to comply with safety policies.
Attacker (Attacker Agent)An internal agent (module) that generates attack prompts from a seed + strategy.
Target ModelThe LLM or agent system being evaluated (including internal/external models, RAG).
ScorerA module that automatically scores the Target's response on a scale of 1-10 to indicate the level of risk (e.g., 1=safe, 10=high risk). Scores above the threshold are considered successful attacks (Unsafe)..
Strategy LibraryA catalog of proven attack strategies, where each strategy includes tags, descriptions, and application examples.
ASR (Attack Success Rate)The percentage of responses to which the Scorer assigned a score above the threshold, representing the core metric for attack success rate.