Model & Agent
In Datumo Eval, Model and Agent are core components of the evaluation pipeline. Understanding the roles of the Target Model being evaluated and the Judge Model performing evaluation is crucial.
Types of Models
1. Target Model (Model Being Evaluated)
① Role
The AI model that is the subject of evaluation.
② Functions
- Generates responses to Queries in the Dataset.
- Generated responses are evaluated by the Judge Model.
- Supports various LLM providers (OpenAI, Anthropic, Google, etc.).
- Custom API endpoint connection is available.
2. Judge Model (Evaluation Model)
① Role
The AI model that evaluates responses from the Target Model.
② Functions
- Evaluates response quality according to defined Metric criteria.
- Generates scores and evaluation reasoning.
- High-performance models are recommended for consistent evaluation.
Agent Concept
1. What is an Agent
① Definition
An Agent is a model instance with specific roles and configurations.
② Components
| Component | Description |
|---|---|
| Base Model | The underlying LLM (e.g., GPT-4, Claude) |
| System Prompt | Prompt defining the model's role and behavior |
| Temperature | Parameter controlling response creativity/consistency |
| Max Tokens | Maximum response length limit |
2. Agent Usage Examples
① RAG Agent
Generates responses based on retrieved context.
② Safety Agent
Applies guidelines for safe response generation.
③ Domain Expert Agent
Generates responses specialized in specific domains.
Judge Model Selection Criteria
1. Recommendations
① Use High-Performance Models
Latest models like GPT-4, Claude 3, etc. are recommended for accurate evaluation.
② Ensure Consistency
Maintain evaluation consistency with low Temperature settings.
③ Sufficient Context Window
Select models that can evaluate long responses.
2. Considerations
① Potential Bias
Using the same model as both Target and Judge may introduce bias.
② Balance Cost and Performance
Consider the balance between cost and performance.
③ Purpose-Appropriate Selection
Select models appropriate for the evaluation purpose (e.g., multilingual support for multilingual evaluation).
Model Registration and Management
1. API Key Management
① API Key Registration
Register API Keys for each provider in Settings.
② Security
Provides encrypted storage for security.
③ Team Sharing
Team-level Key sharing is available.
2. Custom Model Connection
① REST API Connection
Supports REST API endpoint connections.
② On-Premises Integration
On-premises model integration is available.
③ Response Format Mapping
Provides response format mapping configuration.
Role in Evaluation Flow
1. Evaluation Process
① Complete Flow
Query → Target Model → Response → Judge Model → Score & Reasoning
② Step-by-Step Description
- Query Delivery: Query from Dataset is delivered to the Target Model.
- Response Generation: Target Model generates a Response.
- Evaluation Execution: Judge Model evaluates based on Metric criteria.
- Result Production: Generates score and evaluation reasoning.
Related Documents
- Evaluation Task - Model configuration in Task
- Metrics - Evaluation criteria definition
- Model Management Tutorial - API Key registration methods