Skip to main content

Quick Start

Evaluation Task

An Evaluation Task is the core evaluation workflow in DATUMO Eval.
This guide walks you through the entire flow from start to finish.

Full Flow

  1. Create Dataset → 2) Add Model/Agent → 3) Create Evaluation Task
  2. Create & Run Eval Set → 5) Review Results
  • You must have a login/workspace ready.
  • To run model calls or LLM-based automatic evaluation, you need to connect a model/API key (see Step 0).

0. Add Target Model/Agent

To generate model outputs during evaluation or to use LLM-based metrics, you first need to register a model or agent.
Go to Management → Model Management to add one.

  • If you already have a query dataset generated by the target model, a single judge model is sufficient.
  • You can add or update models anytime later in Management.

👉 More details: /setup-guide/model-management


1. Prepare Dataset

A good evaluation starts with a good Dataset.
There are two main ways to prepare one:

  1. Create directly in the Dataset page (Context, Query, Response).
    • If Query/Response is empty, you can auto-generate them later with a model.
  2. Upload CSV/XLSX if you already have data.

👉 You may also add an Expected Response column to enable reference-based metrics.
👉 For details on dataset structure (Context/Query/Response fields), see the Dataset Guide.


2. Create Evaluation Task

Click New Task to create a new Evaluation Task.
It’s recommended to manage each task as a single evaluation unit (e.g., quality check for a specific model, or comparison between two models).

  • Within a Task, you can manage results from multiple Response Sets through Evaluation Sets.
  • Datasets that already include target model outputs (Responses) can be found under Dataset-Response.

👉 Task creation guide: /tutorials/judgment-eval/eval-task/create-task


3. Run Evaluation

Once your Dataset and Task are ready, you can run evaluations.
It’s best to combine automatic (LLM/algorithm-based) and manual (human) evaluations.

1) Automatic Evaluation (LLM/algorithm-based)

  • From the Evaluation Task, select multiple Response Sets (model/service outputs) and Metrics, then start the evaluation.
  • The BEIR + Judgment Leaderboard is enabled when both gold_context and retrieved_context are present.
  • Once the Task is launched, evaluations run across all selected items in batch.

2) Manual Evaluation (Human Evaluation)

  • In Manual Evaluation or Interactive Evaluation, evaluators directly assign scores to model outputs by metric.
  • Combining this with automatic scores allows for more comprehensive and nuanced results.

👉 Metric setup: /setup-guide/metric-management
👉 Run automatic evaluation: /tutorials/judgment-eval/eval-task/eval-results
👉 Run manual evaluation: /tutorials/human-eval/


4. Review Results

After evaluation, go to Dashboard / Task Metrics to review results:

  • Model/Prompt comparison: Check averages, distributions, and deviations to identify the more reliable option.
  • TableView: Inspect query-level scores and responses, quickly identify issues, and refine accordingly.

You can also Export results for reports or team sharing.

👉 Results guide: /tutorials/judgment-eval/eval-task/eval-results


5. Next Steps (Optional)

Leverage advanced features for deeper analysis:


FAQ

Q1. Why are reference-based metrics disabled?
A. Make sure your dataset file includes an expected_response column.
This column must be present in the dataset (e.g., in your CSV/XLSX file) for reference-based metrics to be enabled.

Q2. Can I run manual evaluation without a model key?
A. Yes. Manual evaluation is possible without a model key. However, response generation and LLM-based automatic evaluation require one (see Step 0).

Q3. How do I adjust scoring criteria (rubrics)?
A. In Metrics, you can edit existing metrics or add custom ones to match your team’s evaluation standards.


At a Glance
  1. Add Model/API Key → 2) Prepare Dataset → 3) Create Task → 4) Run Auto/Manual Evaluation → 5) Review Results in Dashboard