Skip to main content

How to Interpret and Export Evaluation Results


Step 4. Reviewing Evaluation Results

Once the evaluation is complete, you can view the results in the Table View.
Each sample (query–response pair) is displayed in a separate row,
and each metric (e.g., Precision, Recall, Faithfulness) is presented as a score between 0 and 1.
Clicking on a cell opens the Detail Panel on the right, where you can inspect claim-level evaluation results.

🧩 Detail Panel Overview

The Detail Panel visualizes the claim-level evaluation results for each selected sample (query–response pair).
Each section provides the following information:

  1. Query Section
    Displays the original user query (query) and relevant metadata.
    This serves as the reference question for comparing the Expected Response (ER) and Target Response (TR).
    Click “View reference context” to access the retrieved documents (contexts) corresponding to this query.

  2. Model Response Section
    Shows the actual response (response) generated by the Target Model.
    The Decomposition and Entailment processes are visualized at the claim level,
    with each claim labeled by color and assigned a score.

    • 2-1. Claim Score Summary
      Summarizes claim-level evaluation scores for the entire response.
    • 2-2. Claim-Level Judgments
      Each claim is tagged as Entailed, Contradicted, or Irrelevant.
      Additional labels such as “Context Entailed” or “Context Refuted” are displayed beside each claim.
    • 2-3. Full Target Model / Agent Response
      Displays the complete text output generated by the model.
  3. Expected Response Section
    Displays the claims decomposed from the Expected Response (ER).
    Similar to the Query section, this includes the claim content, claim-level scores, and the full ER text.

  4. Retrieved Context Section
    Shows the documents (Retrieved Context) referenced by the model during generation.
    Each document includes entailment results indicating whether it supports any evaluated claims.
    The top of this section summarizes the Context Precision score.

    • Example:
      • C1: Contains one or more correct claims (Relevant Context)
      • C2: Contains no correct claims (Irrelevant Context)

Step 5. Exporting Results

To export the evaluation results, click the Export button at the top of the Table View.
The results will be downloaded as an .xlsx file,
which includes key metrics such as Precision, Recall, Faithfulness, and Hallucination for each sample.

🔍 Tip:

  • The exported file can be used for further analysis or reporting,
    such as comparing performance across different seeds or models.
  • Detailed claim-level results are available only within the Detail Panel in the UI.