Skip to main content

Model & Agent

Overview

In Datumo Eval, Model and Agent are core components of the evaluation pipeline. Understanding the roles of the Target Model being evaluated and the Judge Model performing evaluation is crucial.


Types of Models

1. Target Model (Model Being Evaluated)

① Role

The AI model that is the subject of evaluation.

② Functions

  • Generates responses to Queries in the Dataset.
  • Generated responses are evaluated by the Judge Model.
  • Supports various LLM providers (OpenAI, Anthropic, Google, etc.).
  • Custom API endpoint connection is available.

2. Judge Model (Evaluation Model)

① Role

The AI model that evaluates responses from the Target Model.

② Functions

  • Evaluates response quality according to defined Metric criteria.
  • Generates scores and evaluation reasoning.
  • High-performance models are recommended for consistent evaluation.

Agent Concept

1. What is an Agent

① Definition

An Agent is a model instance with specific roles and configurations.

② Components

ComponentDescription
Base ModelThe underlying LLM (e.g., GPT-4, Claude)
System PromptPrompt defining the model's role and behavior
TemperatureParameter controlling response creativity/consistency
Max TokensMaximum response length limit

2. Agent Usage Examples

① RAG Agent

Generates responses based on retrieved context.

② Safety Agent

Applies guidelines for safe response generation.

③ Domain Expert Agent

Generates responses specialized in specific domains.


Judge Model Selection Criteria

1. Recommendations

① Use High-Performance Models

Latest models like GPT-4, Claude 3, etc. are recommended for accurate evaluation.

② Ensure Consistency

Maintain evaluation consistency with low Temperature settings.

③ Sufficient Context Window

Select models that can evaluate long responses.

2. Considerations

① Potential Bias

Using the same model as both Target and Judge may introduce bias.

② Balance Cost and Performance

Consider the balance between cost and performance.

③ Purpose-Appropriate Selection

Select models appropriate for the evaluation purpose (e.g., multilingual support for multilingual evaluation).


Model Registration and Management

1. API Key Management

① API Key Registration

Register API Keys for each provider in Settings.

② Security

Provides encrypted storage for security.

③ Team Sharing

Team-level Key sharing is available.

2. Custom Model Connection

① REST API Connection

Supports REST API endpoint connections.

② On-Premises Integration

On-premises model integration is available.

③ Response Format Mapping

Provides response format mapping configuration.


Role in Evaluation Flow

1. Evaluation Process

① Complete Flow

Query → Target Model → Response → Judge Model → Score & Reasoning

② Step-by-Step Description

  1. Query Delivery: Query from Dataset is delivered to the Target Model.
  2. Response Generation: Target Model generates a Response.
  3. Evaluation Execution: Judge Model evaluates based on Metric criteria.
  4. Result Production: Generates score and evaluation reasoning.