Skip to main content

Quantitative Evaluation Overview

A method for numerically evaluating the performance of AI models through objective and measurable metrics. It provides consistent and reproducible evaluation results, enabling objective comparisons between models.

1. Harness Task

A system that measures the performance of AI models through standardized benchmark tests and allows for the comparison of various models' performance via a leaderboard.

Provided Datasets and Tasks
  • HRM8K: Evaluates mathematical problem-solving abilities.
  • KMMLU: Korean Multi-domain Multi-task Language Understanding.
  • KOBEST: Korean Benchmark Suite for Natural Language Understanding.
  • Provides selected subsets of other standard benchmark datasets.

2. Reference-based (NLP-based Evaluation)

This evaluation method assesses the quality of AI-generated text by comparing it to one or more reference (ground truth) texts.

Key Evaluation Metrics
  • BLEU: N-gram-based similarity measurement for evaluating translation quality.
  • ROUGE: Text overlap measurement for evaluating summarization quality.
  • METEOR: Machine translation evaluation that considers semantic similarity.
  • TER: Translation Edit Rate, measures the error rate of translations based on edit distance. *A score closer to 0 is better.
  • BERT: Semantic similarity measurement using BERT embeddings.