Skip to main content

Quick Start

Evaluation Task

This guide quickly demonstrates the complete workflow for performing a basic evaluation (Task) from start to finish in DATUMO Eval. It explains the process from model connection, Dataset preparation, Task creation, evaluation execution, to result analysis in 5 steps.

Complete Flow

  1. Create Dataset → 2) Add Model/Agent → 3) Create Evaluation Task
  2. Create & Run Eval Set → 5) Review Results
  • Login/Workspace must be ready.
  • You need to connect a model/API key for model calls or LLM-based automatic evaluation. (See Step 0 below)

0. Register Model·Agent (Required)

To use model response generation or LLM-based automatic evaluation, you must first register a model or agent. Registration is done in Management → Model Management.

  • To use Query·Response auto-generation or LLM Judge-based evaluation, you need at least 1 Judge model (API key).
  • If you already have a Dataset generated with a target model, you can start evaluation by registering just one Judge model.

👉 More details: /tutorials/settings/model-management


1. Prepare Dataset

Evaluation starts with data. Datasets can be prepared in two ways:

AI Generation (Dataset Page)

If you want to create a new evaluation dataset from scratch, you can generate it step by step after uploading reference documents on the Context Dataset page.

  • It's okay if Query or Response is empty. You can auto-generate with a model in later steps.
  • When generating Queries, you can adjust custom parameters to specify the model's role, style, and behavior. (Example: "Respond as a customer support assistant helping with branch visit consultations.")

Generation Flow

  1. Access the Context page and upload a local file (.csv, .XLSX) with reference documents as the generation basis (Example: Query generation → select context set, Response generation → select query set)
  2. Select query generation model
  3. Save Dataset

👉 More details: /core-concepts/datumo-concepts/dataset


2. Create Evaluation Task

Click New Task to create a new Evaluation Task. It's recommended to manage each Task as a single evaluation unit (e.g., measuring specific model quality, comparing two models, etc.).

  • In a Task, you can manage results from multiple Response Sets through Evaluation Sets.
  • Datasets containing the target model's Output (Response) can be found in Dataset-Response.

👉 Task creation guide: /tutorials/evaluation/judgment-eval/eval-task/create-task


3. Run Evaluation (Evaluate)

Once Dataset and Task are ready, execute the evaluation in earnest. It's ideal to use automatic and manual evaluation together.

  1. Automatic Evaluation (LLM/Algorithm-based): To evaluate multiple results at once, select multiple Response Sets (model/service Outputs) and evaluation criteria (Metrics) in "Evaluation Task" and start the evaluation. Choose from pre-provided Metrics or create custom Metrics.

    • Select metrics to use in Metrics. (Example: Bias, illegal, Response Relevancy, etc.)
    • BEIR+judgment-Leaderboard is activated when gold_context and retrieved_context are present.
    • Once Task creation is complete, batch evaluation proceeds for all selected items.
  2. Manual Evaluation (Human Evaluation)

    • In Manual Evaluation or Interactive Evaluation, humans directly assign scores.
    • More sophisticated analysis is possible when referenced together with automatic evaluation results.

👉 Metrics configuration guide: /tutorials/settings/metric-management 👉 Run automatic evaluation: /tutorials/evaluation/judgment-eval/eval-task/eval-results 👉 Run manual evaluation: /tutorials/evaluation/human-eval/


4. Interpret Results (Results)

After evaluation, you can check results in Dashboard / Task Metrics.

  • Model/Prompt comparison: Identify performance differences through average scores, distribution, deviation
  • TableView: Check query-specific scores and responses to quickly identify problem cases

Export results to Export for use in reports or shared documents as needed.

👉 Explore results: /tutorials/evaluation/judgment-eval/eval-task/eval-results


5. Next Steps (Optional)

You can proceed with additional analysis through various evaluations.


Frequently Asked Questions (FAQ)

Q1. Reference-based metrics are disabled. Why? A. Check if you included reference answers like Expected Response in your dataset. The metric is activated only when a reference exists.

Q2. Is manual evaluation possible without a model key? A. Yes, it's possible. However, response generation or LLM-based automatic evaluation requires a model key. (See Step 0)

Q3. I want to change scoring criteria (rubrics). A. Edit metrics/rubrics or add custom metrics in Metrics. You can adjust them to match your team's evaluation criteria.


Summary
  1. Add Model/API Key → 2) Prepare Dataset → 3) Create Task → 4) Run Auto/Manual Evaluation → 5) Review Results in Dashboard