Quick Start
This guide quickly demonstrates the complete workflow for performing a basic evaluation (Task) from start to finish in DATUMO Eval. It explains the process from model connection, Dataset preparation, Task creation, evaluation execution, to result analysis in 5 steps.
Complete Flow
- Create Dataset → 2) Add Model/Agent → 3) Create Evaluation Task
- Create & Run Eval Set → 5) Review Results
- Login/Workspace must be ready.
- You need to connect a model/API key for model calls or LLM-based automatic evaluation. (See Step 0 below)
0. Register Model·Agent (Required)
To use model response generation or LLM-based automatic evaluation, you must first register a model or agent. Registration is done in Management → Model Management.
- To use Query·Response auto-generation or LLM Judge-based evaluation, you need at least 1 Judge model (API key).
- If you already have a Dataset generated with a target model, you can start evaluation by registering just one Judge model.
👉 More details: /tutorials/settings/model-management
1. Prepare Dataset
Evaluation starts with data. Datasets can be prepared in two ways:
- Option A: AI Generation
- Option B: Upload Local File
AI Generation (Dataset Page)
If you want to create a new evaluation dataset from scratch, you can generate it step by step after uploading reference documents on the Context Dataset page.
- It's okay if Query or Response is empty. You can auto-generate with a model in later steps.
- When generating Queries, you can adjust custom parameters to specify the model's role, style, and behavior. (Example: "Respond as a customer support assistant helping with branch visit consultations.")
Generation Flow
- Access the Context page and upload a local file (.csv, .XLSX) with reference documents as the generation basis (Example: Query generation → select context set, Response generation → select query set)
- Select query generation model
- Save Dataset
👉 More details: /core-concepts/datumo-concepts/dataset
Configure Dataset via CSV/XLSX Upload
If you already have operational data or existing evaluation data, you can directly upload CSV/XLSX files through New Dataset → Upload File Directly.
- Dataset is automatically created according to the file structure.
- Model information used for data generation may be needed, so if the model doesn't exist, first register it in Model Management.
👉 Dataset upload guide: /core-concepts/datumo-concepts/dataset
2. Create Evaluation Task
Click New Task to create a new Evaluation Task. It's recommended to manage each Task as a single evaluation unit (e.g., measuring specific model quality, comparing two models, etc.).
- In a Task, you can manage results from multiple Response Sets through Evaluation Sets.
- Datasets containing the target model's Output (Response) can be found in Dataset-Response.
👉 Task creation guide: /tutorials/evaluation/judgment-eval/eval-task/create-task
3. Run Evaluation (Evaluate)
Once Dataset and Task are ready, execute the evaluation in earnest. It's ideal to use automatic and manual evaluation together.
-
Automatic Evaluation (LLM/Algorithm-based): To evaluate multiple results at once, select multiple Response Sets (model/service Outputs) and evaluation criteria (Metrics) in "Evaluation Task" and start the evaluation. Choose from pre-provided Metrics or create custom Metrics.
- Select metrics to use in Metrics. (Example: Bias, illegal, Response Relevancy, etc.)
- BEIR+judgment-Leaderboard is activated when
gold_contextandretrieved_contextare present. - Once Task creation is complete, batch evaluation proceeds for all selected items.
-
Manual Evaluation (Human Evaluation)
- In Manual Evaluation or Interactive Evaluation, humans directly assign scores.
- More sophisticated analysis is possible when referenced together with automatic evaluation results.
👉 Metrics configuration guide: /tutorials/settings/metric-management 👉 Run automatic evaluation: /tutorials/evaluation/judgment-eval/eval-task/eval-results 👉 Run manual evaluation: /tutorials/evaluation/human-eval/
4. Interpret Results (Results)
After evaluation, you can check results in Dashboard / Task Metrics.
- Model/Prompt comparison: Identify performance differences through average scores, distribution, deviation
- TableView: Check query-specific scores and responses to quickly identify problem cases
Export results to Export for use in reports or shared documents as needed.
👉 Explore results: /tutorials/evaluation/judgment-eval/eval-task/eval-results
5. Next Steps (Optional)
You can proceed with additional analysis through various evaluations.
- Ragas-based Evaluation: To evaluate with metrics provided by Ragas /tutorials/evaluation/judgment-eval/ragas-task/
- RAG Quality Check: To separately check context retrieval accuracy /tutorials/evaluation/judgment-eval/rag/rag-checker
- Auto Red Team Testing: To expand safety vulnerability detection scenarios /tutorials/evaluation/judgment-eval/auto-redteaming/overview
Frequently Asked Questions (FAQ)
Q1. Reference-based metrics are disabled. Why? A. Check if you included reference answers like Expected Response in your dataset. The metric is activated only when a reference exists.
Q2. Is manual evaluation possible without a model key? A. Yes, it's possible. However, response generation or LLM-based automatic evaluation requires a model key. (See Step 0)
Q3. I want to change scoring criteria (rubrics). A. Edit metrics/rubrics or add custom metrics in Metrics. You can adjust them to match your team's evaluation criteria.
- Add Model/API Key → 2) Prepare Dataset → 3) Create Task → 4) Run Auto/Manual Evaluation → 5) Review Results in Dashboard