Manual Red Teaming

Manual Red Teaming is a human-in-the-loop evaluation method to test LLM vulnerabilities.
Evaluators craft prompts based on various Offensive Strategies and manually inspect whether the model returns unsafe or problematic responses.

This guide walks you through the full setup process to start Red Teaming.

Follow the steps below in order to quickly set up your evaluation project.

Add Strategy

Create and manage red teaming strategies.

Key Features

Add and delete strategies
Enable or disable strategies
Monitor generation success rate by strategy
Extract data by strategy and validation outcome

Add / Manage Users

Add and manage evaluators for manual red teaming tasks.

Key Features

Add or delete users
Edit user account information

Worker Red Teaming Tasks

Evaluators perform red teaming based on selected strategies and submit their results.

Key Features

Select a strategy
View strategy details and attack the model
Submit when the response is judged as a successful attack

Review Red Teaming Results

Review submitted red teaming attempts based on defined validation criteria.

Key Features

Add or delete validation rules
Review and label red teaming results according to validation criteria