Automated Red Teaming
Auto-Redteaming is an automated red team system that automatically generates attack prompts based on seed sentences to evaluate the safety and vulnerability of a model.
It attempts attacks by combining multiple strategies and provides a report by having a Scorer quantitatively evaluate the responses.
① Auto-Redteaming TASK Creation Procedure
Step 1. Enter Page and Create Task
-
Go to the Auto-Redteaming feature page.
This is a pre-registered list screen. Auto-Redteaming is performed immediately after TASK creation. -
If there are no registered TASKs, click + New Task at the top to create a new TASK.
Step 2. Evaluation Settings
- Download the Seed Only template and write the seed sentences (queries) to be used for the evaluation.
- Upload the created seed file.
- Select the Target Model / Agent to be evaluated.
- Set the number of repeated attacks per seed (Max Red Teaming Runs).
- Select the attack strategy classification (Taxonomy).
After entering all fields, click Add Red Teaming Task to create the project.
- Upload File: Upload the evaluation seed file
- Target Model: Select the LLM to be evaluated
- Max Red Teaming Runs: Set the number of repetitions per seed
- Select Taxonomy: Select the strategy classification system
Step 3. Create and Run TASK
The evaluation is performed at the same time as the TASK is created by clicking Add Red Teaming Task.
② Check Results
Step 4. Evaluation in Progress
- The TASK status is displayed as Red Teaming.
- Click View Dataset / Progress to check the progress and real-time logs (sampling/attempt status).
Step 5. Complete Evaluation and Check Report
- After the evaluation is complete, you will automatically receive a statistics-based report on the TASK details screen.
- The report includes a summary of overall performance and key evaluation indicators, allowing you to intuitively grasp the model's defense capabilities as Safe and Unsafe.
Example of major report items:
- Model name, number of repetitions, Safe/Unsafe ratio
- Summary of vulnerabilities by strategy/category
📌 The report is used to analyze the model's safety trends and identify vulnerable areas.
Step 6. Check Detailed Results
- Click on each seed sentence to check the individual evaluation logs (repeated attempts, evaluation scores, strategies used, Target responses, etc.).
- Prioritize analyzing seeds that have been repeatedly judged as Unsafe to prepare response strategies or use them as model improvement points.
📌 You can establish response strategies centered on seed sentences that have been repeatedly judged as Unsafe, or use them to improve performance for each problem type.
