๋ณธ๋ฌธ์œผ๋กœ ๊ฑด๋„ˆ๋›ฐ๊ธฐ

Evaluation Taskโ€‹

๐Ÿ“Š Basic Evaluation Flow

Evaluation Task๋Š” Datumo Eval์—์„œ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ด ๋˜๋Š” ํ‰๊ฐ€ ์›Œํฌํ”Œ๋กœ์šฐ์ž…๋‹ˆ๋‹ค.
Judge ํ‰๊ฐ€ ๋ชจ๋ธ์„ ํ™œ์šฉํ•ด Target ๋ชจ๋ธ์˜ ์‘๋‹ต์„ ๋น„๊ตยทํ‰๊ฐ€ํ•˜๋ฉฐ, Dataset ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์ˆ˜์น˜ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ „์ฒด Flow๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. Task ์ƒ์„ฑ (Create an Evaluation Task)
  2. Eval Set ์ƒ์„ฑ ๋ฐ ํ‰๊ฐ€ ์‹คํ–‰ (Run Eval Set)
  3. ํ‰๊ฐ€ ๊ฒฐ๊ณผ ํ™•์ธ (Check Results)
  4. (Advanced) Task ํ‰๊ฐ€ ๊ด€๋ฆฌ, ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์ˆ˜์ •, BEIR Leaderboard ๋ทฐ ํ™•์ธ