Manual Red Teaming
Manual Red Teaming is a human-in-the-loop evaluation method to test LLM vulnerabilities.
Evaluators craft prompts based on various Offensive Strategies and manually inspect whether the model returns unsafe or problematic responses.
This guide walks you through the full setup process to start Red Teaming.
Follow the steps below in order to quickly set up your evaluation project.
Add Strategy
Create and manage red teaming strategies.
Key Features
- Add and delete strategies
- Enable or disable strategies
- Monitor generation success rate by strategy
- Extract data by strategy and validation outcome
Add / Manage Users
Add and manage evaluators for manual red teaming tasks.
Key Features
- Add or delete users
- Edit user account information
Worker Red Teaming Tasks
Evaluators perform red teaming based on selected strategies and submit their results.
Key Features
- Select a strategy
- View strategy details and attack the model
- Submit when the response is judged as a successful attack
Review Red Teaming Results
Review submitted red teaming attempts based on defined validation criteria.
Key Features
- Add or delete validation rules
- Review and label red teaming results according to validation criteria