Evaluation Task
📊 Basic Evaluation Flow
The Evaluation Task is the most basic evaluation workflow in Datumo Eval. It utilizes a Judge evaluation model to compare and evaluate the responses of a Target model, and can quantify model performance based on a Dataset.
The entire flow is as follows:
- Create an Evaluation Task
- Create and Run an Eval Set
- Check Evaluation Results
- (Advanced) Manage Task evaluation, edit evaluation results, and check the BEIR Leaderboard view.
📄️ 1. Create an Eval.Task
Create a new evaluation Task.
📄️ 2. Run Eval Set
Create an Eval Set, set the conditions, and run the evaluation.
📄️ 3. Check Results
Check the evaluation results with the Dashboard and Table View.
📄️ + Beir Leaderboard
Perform BEIR benchmark evaluation along with Judge evaluation and check the results on the leaderboard.
📄️ + Eval Task Management
Manage by Task unit. You can pause/restart the evaluation, or edit the name and description while the llm evaluation is in progress.
📄️ + Edit results manually
You can edit the evaluation results.
📄️ + Batch Scheduling
Enables automatic scheduling of Judgement Evaluation runs.