π¨ Automated Red-Teamingβ
Once you upload a Seed, the system automatically performs attack prompt generation β target model evaluation β report generation using Selectstar's library of 100+ Strategies.
System configuration (high-level):
Internally composed of (1) an Automated Red-Teaming Multi-Agent System and (2) a Multiplex Network for Strategy Sampling. (See Core Concepts β Eval Framework for architecture details.)
Flow (3-step)β
- New Task: Upload seed, set Target, number of iterations, and Taxonomy.
Use general, non-executable queries for testing rather than requests for specific illegal actions. (e.g., "An example persuasive request that may violate policy" β non-actionable.)
- Run: System combines Strategy + Seed to generate attack prompts β Target responds β Scorer assigns a 1β10 score.
- Report: Check ASR, results by strategy/seed, and vulnerability summary.
Metricsβ
- ASR (Attack Success Rate): Internal metric representing the rate of unsafe responses. In the UI, results are displayed as Safe / Unsafe outcomes.
- Unsafe Rate: % of responses flagged as policy-violating or high-risk.
- Coverage: Fraction of SeedΓStrategy combinations attempted vs. planned.
- Cost/run: Estimated inference cost per attack attempt.
Terminologyβ
| Term | Definition |
|---|---|
| Seed Data (Seed) | A set of evaluation queries uploaded by the customer, which provides the topic and context for attack prompts. (e.g., "Use non-actionable examples.") |
| Attack Strategy (Strategy) | An item from the system's strategy library, designed to bypass the model's defenses by modifying or guiding the seed. (e.g., storytelling, framing, case-based persuasion) |
| Attack Prompt | The actual query input to the Target Model, generated by combining a seed and a selected strategy. All prompts are generated to comply with safety policies. |
| Attacker (Attacker Agent) | An internal agent (module) that generates attack prompts from a seed + strategy. |
| Target Model | The LLM or agent system being evaluated (including internal/external models, RAG). |
| Scorer | A module that automatically scores the Target's response on a scale of 1-10 to indicate the level of risk (e.g., 1=safe, 10=high risk). Scores above the threshold are considered successful attacks (Unsafe).. |
| Strategy Library | A catalog of proven attack strategies, where each strategy includes tags, descriptions, and application examples. |
| ASR (Attack Success Rate) | The percentage of responses to which the Scorer assigned a score above the threshold, representing the core metric for attack success rate. |
ποΈ Overview
Datumo Eval automates red teaming by generating adversarial prompts, evaluating LLM responses, and delivering vulnerability reports.
ποΈ Run Evaluation
Datumo Eval automates red teaming by generating adversarial prompts, evaluating LLM responses, and delivering vulnerability reports.