Skip to main content

Metrics

Overview

Metric is a criterion for quantitatively evaluating the response quality of AI models. Designing Metrics appropriate to the evaluation purpose is key to model comparison and performance analysis, and Datumo Eval supports the reuse of Metrics defined in various ways.


Metric Concept

Metric is a unit that defines both "what to evaluate" and "how to convert that criterion into a score." The Rubric describes the substantive criteria for evaluation, while the Method determines how to calculate scores based on the Rubric (e.g., Likert, G-Eval). Through this structure, each Metric simultaneously defines the evaluation perspective and calculation method, forming a quantitative evaluation system that enables performance comparison between models.


Metric Components

Metric is a unit that defines both what criteria to use for evaluating model responses and how to calculate scores based on those criteria. Metrics consist of two core elements.

Rubric clearly describes the evaluation criteria to be measured and the meaning of each score range, defining from what perspective evaluators should judge responses. Method is an element that determines by what rules to calculate scores based on the Rubric. In Datumo Eval, methods are provided such as the Likert Scale which uses scores as-is, and G-Eval which calculates expected values using scores and probabilities.

In Judgment Evaluation, user-defined Rubrics and Methods are applied to the Judge model to evaluate responses with consistent criteria. In fixed Metrics such as Auto Red Teaming or RAG Checker, predefined Rubrics and Methods are used in the same format, allowing user-defined Metrics and fixed Metrics to be interpreted within a unified structure.


Metric Category Examples

Metrics can be designed in various categories depending on the evaluation purpose, and in Datumo Eval they can be understood along two axes.

In automated evaluation methods, Metrics provided by Datumo Eval as fixed options are used. For example, in Auto Red Teaming, 12 safety-related Metrics such as Bias and Illegal are applied by default, and in RAG Checker, quantitative-based Metrics such as F1 Score are used to calculate the concordance between reference documents and responses.

On the other hand, in Rubric-based Judgment Evaluation, users can define Metrics directly. RAG quality Metrics such as Faithfulness and Groundedness, or general quality Metrics such as Coherence, Fluency, and Helpfulness are configured by users designing Scales and Rubrics according to evaluation purposes.

Thus, fixed Metrics and user-defined Metrics are configured together to build various Metric systems according to evaluation type and purpose.

For detailed information on Metric definitions, input/output formats, quantitative evaluation criteria, etc., please refer to the Evaluation Metrics document for more specific details.


Conceptual Understanding of Metric Aggregation

When using multiple Metrics together, there are conceptual approaches to how to interpret the scores of each Metric.

Generally, the following Score Aggregation methods are used: AND method where multiple Metrics must all meet criteria, Weighted Sum method where multiple scores are summed based on weights, and OR method where meeting just one of multiple criteria is sufficient are representative examples.

Rather than being calculation rules for directly generating final scores, these approaches can be understood as conceptual frameworks for how to interpret evaluation results when using multiple Metrics together. They are useful in situations where evaluation purposes are complex or multiple criteria must be considered simultaneously.


Considerations When Creating Metrics

It is important to configure Metrics with criteria appropriate to the evaluation purpose. Rubrics should be written clearly so that score criteria do not overlap, and Methods should be selected as scoring methods appropriate to the evaluation purpose. Minimizing evaluation items to the necessary level and grouping Metrics with similar purposes helps with interpretation and maintenance.