Overview | Datumo Eval Docs

Quantitative Evaluation Overview

A method for numerically evaluating the performance of AI models through objective and measurable metrics. It provides consistent and reproducible evaluation results, enabling objective comparisons between models.

📄️ Harness Task

Harness Task

📄️ Reference-based

Reference-based

1. Harness Task

A system that measures the performance of AI models through standardized benchmark tests and allows for the comparison of various models' performance via a leaderboard.

Provided Datasets and Tasks

HRM8K: Evaluates mathematical problem-solving abilities.
KMMLU: Korean Multi-domain Multi-task Language Understanding.
KOBEST: Korean Benchmark Suite for Natural Language Understanding.
Provides selected subsets of other standard benchmark datasets.

2. Reference-based (NLP-based Evaluation)

This evaluation method assesses the quality of AI-generated text by comparing it to one or more reference (ground truth) texts.

Key Evaluation Metrics

BLEU: N-gram-based similarity measurement for evaluating translation quality.
ROUGE: Text overlap measurement for evaluating summarization quality.
METEOR: Machine translation evaluation that considers semantic similarity.
TER: Translation Edit Rate, measures the error rate of translations based on edit distance. *A score closer to 0 is better.
BERT: Semantic similarity measurement using BERT embeddings.

Quantitative Evaluation Overview​

📄️ Harness Task

📄️ Reference-based

1. Harness Task​

2. Reference-based (NLP-based Evaluation)​

Quantitative Evaluation Overview

1. Harness Task

2. Reference-based (NLP-based Evaluation)