Task

overview

Task is a unit that collects settings for how to evaluate a specific Dataset with what criteria (Metric·Judge). When evaluation criteria are predefined in a Task, subsequent Eval Sets created from that Task can execute evaluations using the same criteria.

Task is the unit that defines a single evaluation method in Datumo Eval. In a Task, you can specify which Dataset to use and which Metric and Judge settings to use as evaluation criteria. Based on these settings, users can create multiple Eval Sets to compare performance between models or track trends over time. In this structure, Task serves as an evaluation scenario that organizationally manages a specific evaluation purpose.

Concept and Role of Task

Task is a higher-level unit in Datumo Eval for operating and managing evaluations, serving to group multiple Eval Sets based on a specific theme or purpose. Users create Tasks to define evaluation objectives or analysis categories, and within that Task, can create various Eval Sets by selecting different combinations of Datasets, Metrics, and evaluation models. As a result, Task is not merely a collection of settings, but is utilized as an evaluation scenario that structurally manages multiple execution results centered around a specific evaluation purpose.

Task Components

Task is a higher-level unit configured to create multiple Eval Sets under a specific evaluation purpose and view aggregated results. On the Task screen, you can check the execution status and evaluation results of created Eval Sets, and if needed, create new Eval Sets to perform additional evaluations.

The internal structure of a Task naturally divides into three flows. First, the Dashboard allows you to check the overall evaluation status and result overview for all Eval Sets included in that Task. Next, through the Eval Set list, you can individually view and manage each execution result, and in Table View, you can examine Query-level responses and scores for selected Eval Sets in detail for granular analysis of individual executions.

Thanks to this structure, Task operates as an organizational unit that organizes multiple evaluation executions under a single purpose, and systematically manages evaluations through a flow from summary → list → detailed results.

Relationship Between Task and Eval Set

Task is a higher-level unit in Datumo Eval that groups multiple evaluation executions (Eval Sets) under a single purpose for management. Actual evaluation execution takes place in Eval Sets created within a Task, and when creating an Eval Set, users directly select the necessary Dataset, Metrics, Judge model, etc., to complete the evaluation configuration.

While Task does not enforce or automatically apply evaluation criteria, it allows comparison of the impact of model version changes or configuration adjustments by collecting multiple Eval Sets under the same purpose.

In this structure, Task functions as a container that organizationally manages evaluation results, while Eval Set functions as a unit that contains concrete results for each execution.

Concept and Role of Task​

Task Components​

Relationship Between Task and Eval Set​

Concept and Role of Task

Task Components

Relationship Between Task and Eval Set