Quick Start
An Evaluation Task is the core evaluation workflow in DATUMO Eval.
This guide walks you through the entire flow from start to finish.
Full Flow
- Create Dataset → 2) Add Model/Agent → 3) Create Evaluation Task
- Create & Run Eval Set → 5) Review Results
- You must have a login/workspace ready.
- To run model calls or LLM-based automatic evaluation, you need to connect a model/API key (see Step 0).
0. Add Target Model/Agent
To generate model outputs during evaluation or to use LLM-based metrics, you first need to register a model or agent.
Go to Management → Model Management to add one.
- If you already have a query dataset generated by the target model, a single judge model is sufficient.
- You can add or update models anytime later in Management.
👉 More details: /setup-guide/model-management
1. Prepare Dataset
A good evaluation starts with a good Dataset.
There are two main ways to prepare one:
- Create directly in the Dataset page (Context, Query, Response).
- If Query/Response is empty, you can auto-generate them later with a model.
- Upload CSV/XLSX if you already have data.
👉 You may also add an Expected Response column to enable reference-based metrics.
👉 For details on dataset structure (Context/Query/Response fields), see the Dataset Guide.
2. Create Evaluation Task
Click New Task to create a new Evaluation Task.
It’s recommended to manage each task as a single evaluation unit (e.g., quality check for a specific model, or comparison between two models).
- Within a Task, you can manage results from multiple Response Sets through Evaluation Sets.
- Datasets that already include target model outputs (Responses) can be found under Dataset-Response.
👉 Task creation guide: /tutorials/judgment-eval/eval-task/create-task
3. Run Evaluation
Once your Dataset and Task are ready, you can run evaluations.
It’s best to combine automatic (LLM/algorithm-based) and manual (human) evaluations.
1) Automatic Evaluation (LLM/algorithm-based)
- From the Evaluation Task, select multiple Response Sets (model/service outputs) and Metrics, then start the evaluation.
- Choose from built-in Metrics or create custom Metrics.- Examples: Bias, harmful content, Response Relevancy.
- The BEIR + Judgment Leaderboard is enabled when both
gold_contextandretrieved_contextare present. - Once the Task is launched, evaluations run across all selected items in batch.
2) Manual Evaluation (Human Evaluation)
- In Manual Evaluation or Interactive Evaluation, evaluators directly assign scores to model outputs by metric.
- Combining this with automatic scores allows for more comprehensive and nuanced results.
👉 Metric setup: /setup-guide/metric-management
👉 Run automatic evaluation: /tutorials/judgment-eval/eval-task/eval-results
👉 Run manual evaluation: /tutorials/human-eval/
4. Review Results
After evaluation, go to Dashboard / Task Metrics to review results:
- Model/Prompt comparison: Check averages, distributions, and deviations to identify the more reliable option.
- TableView: Inspect query-level scores and responses, quickly identify issues, and refine accordingly.
You can also Export results for reports or team sharing.
👉 Results guide: /tutorials/judgment-eval/eval-task/eval-results
5. Next Steps (Optional)
Leverage advanced features for deeper analysis:
- Ragas-based Evaluation: /tutorials/judgment-eval/ragas-task/
- RAG Quality Checker: /tutorials/judgment-eval/rag/rag-checker
- Automated Red Teaming: /tutorials/judgment-eval/auto-redteaming/overview
FAQ
Q1. Why are reference-based metrics disabled?
A. Make sure your dataset file includes an expected_response column.
This column must be present in the dataset (e.g., in your CSV/XLSX file) for reference-based metrics to be enabled.
Q2. Can I run manual evaluation without a model key?
A. Yes. Manual evaluation is possible without a model key. However, response generation and LLM-based automatic evaluation require one (see Step 0).
Q3. How do I adjust scoring criteria (rubrics)?
A. In Metrics, you can edit existing metrics or add custom ones to match your team’s evaluation standards.
- Add Model/API Key → 2) Prepare Dataset → 3) Create Task → 4) Run Auto/Manual Evaluation → 5) Review Results in Dashboard