Evaluation

Datumo Eval provides comprehensive evaluation methods ranging from AI Judge-based automatic evaluation, to human-led qualitative evaluation, and quantitative evaluation using standard benchmarks. Evaluate LLM response quality in various ways that suit your needs.

Automatically score response quality using models with predefined evaluation metrics.

Evaluators directly review response appropriateness based on rubrics.

Quantify model performance based on standard benchmarks and NLP metrics.

Each evaluation method provides the following features according to its purpose:

Judge Evaluation Judgment Evaluation: A feature set that automatically evaluates model response quality according to defined rubrics using LLM-AI Judge models. Includes Evaluation Task, RAGAs Task, RAG Checker, and Auto Red-Teaming features.
Qualitative Evaluation Human Evaluation: A feature set where humans directly evaluate response appropriateness, creativity, and contextual understanding based on rubrics. Includes Manual Evaluation and Interactive Evaluation features.
Quantitative Evaluation Quantitative Evaluation: A feature set that objectively compares and analyzes model performance using standard benchmark datasets and NLP-based metrics. Includes Harness Task and Reference-based Evaluation features.

Step-by-step usage can be found in the tutorials for each evaluation method.

LLM-Judge Evaluation Judgment Evaluation

Qualitative Evaluation Human Evaluation

Quantitative Evaluation Quantitative Evaluation