Quantitative Evaluation Overview
A method for numerically evaluating the performance of AI models through objective and measurable metrics. It provides consistent and reproducible evaluation results, enabling objective comparisons between models.
📄️ Harness Task
Harness Task
📄️ Reference-based
Reference-based
1. Harness Task
A system that measures the performance of AI models through standardized benchmark tests and allows for the comparison of various models' performance via a leaderboard.
Provided Datasets and Tasks
- HRM8K: Evaluates mathematical problem-solving abilities.
- KMMLU: Korean Multi-domain Multi-task Language Understanding.
- KOBEST: Korean Benchmark Suite for Natural Language Understanding.
- Provides selected subsets of other standard benchmark datasets.
2. Reference-based (NLP-based Evaluation)
This evaluation method assesses the quality of AI-generated text by comparing it to one or more reference (ground truth) texts.
Key Evaluation Metrics
- BLEU: N-gram-based similarity measurement for evaluating translation quality.
- ROUGE: Text overlap measurement for evaluating summarization quality.
- METEOR: Machine translation evaluation that considers semantic similarity.
- TER: Translation Edit Rate, measures the error rate of translations based on edit distance. *A score closer to 0 is better.
- BERT: Semantic similarity measurement using BERT embeddings.