Documentation
¶
Overview ¶
Package evaluation provides an evaluation framework for testing agents.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func GenerateRunName ¶ added in v1.19.0
func GenerateRunName() string
GenerateRunName creates a memorable name for an evaluation run.
Types ¶
type Config ¶ added in v1.19.0
type Config struct {
AgentFilename string // Path to the agent configuration file
EvalsDir string // Directory containing evaluation files
JudgeModel string // Model for relevance checking (format: provider/model, optional)
Concurrency int // Number of concurrent runs (0 = number of CPUs)
TTYFd int // File descriptor for terminal size queries (e.g., int(os.Stdout.Fd()))
Only []string // Only run evaluations matching these patterns
BaseImage string // Custom base Docker image for running evaluations
KeepContainers bool // If true, don't remove containers after evaluation (skip --rm)
}
Config holds configuration for evaluation runs.
type EvalCriteria ¶ added in v1.19.0
type EvalCriteria struct {
Relevance []string `json:"relevance,omitempty"` // Statements that should be true about the response
WorkingDir string `json:"working_dir,omitempty"` // Subdirectory under evals/working_dirs/
Size string `json:"size,omitempty"` // Expected response size: S, M, L, XL
}
EvalCriteria contains the evaluation criteria for a test case.
type EvalRun ¶ added in v1.19.0
type EvalRun struct {
Name string `json:"name"`
Timestamp time.Time `json:"timestamp"`
Duration time.Duration `json:"duration"`
Results []Result `json:"results"`
Summary Summary `json:"summary"`
}
EvalRun contains the results and metadata for an evaluation run.
func Evaluate ¶
func Evaluate(ctx context.Context, ttyOut, out io.Writer, isTTY bool, runName string, runConfig *config.RuntimeConfig, cfg Config) (*EvalRun, error)
Evaluate runs evaluations with a specified run name. ttyOut is used for progress bar rendering (should be the console/TTY). out is used for results and status messages (can be tee'd to a log file).
type EvalSession ¶ added in v1.19.0
type EvalSession struct {
session.Session
Evals EvalCriteria `json:"evals"`
SourcePath string `json:"-"` // Path to the source eval file (not serialized)
}
EvalSession extends session.Session with evaluation criteria.
type Judge ¶ added in v1.20.0
type Judge struct {
// contains filtered or unexported fields
}
Judge runs LLM-as-a-judge relevance checks concurrently.
func NewJudge ¶ added in v1.20.0
NewJudge creates a new Judge that runs relevance checks with the given concurrency. Concurrency defaults to 1 if n < 1.
func (*Judge) CheckRelevance ¶ added in v1.20.0
func (j *Judge) CheckRelevance(ctx context.Context, response string, criteria []string) (passed int, failed, errs []string)
CheckRelevance runs all relevance checks concurrently with the configured concurrency. It returns the number of passed checks, a slice of failed criteria, and any errors encountered.
type Result ¶
type Result struct {
InputPath string `json:"input_path"`
Title string `json:"title"`
Question string `json:"question"`
Response string `json:"response"`
Cost float64 `json:"cost"`
OutputTokens int64 `json:"output_tokens"`
Size string `json:"size"`
SizeExpected string `json:"size_expected"`
ToolCallsScore float64 `json:"tool_calls_score"`
ToolCallsExpected float64 `json:"tool_calls_score_expected"`
HandoffsMatch bool `json:"handoffs"`
RelevancePassed float64 `json:"relevance"`
RelevanceExpected float64 `json:"relevance_expected"`
FailedRelevance []string `json:"failed_relevance,omitempty"`
Error string `json:"error,omitempty"`
RawOutput []map[string]any `json:"raw_output,omitempty"`
}
Result contains the evaluation results for a single test case.
type Runner ¶ added in v1.19.0
type Runner struct {
Config
// contains filtered or unexported fields
}
Runner runs evaluations against an agent.
type Summary ¶ added in v1.19.0
type Summary struct {
TotalEvals int `json:"total_evals"`
FailedEvals int `json:"failed_evals"`
TotalCost float64 `json:"total_cost"`
SizesPassed int `json:"sizes_passed"`
SizesTotal int `json:"sizes_total"`
ToolsPassed float64 `json:"tools_passed"`
ToolsTotal float64 `json:"tools_total"`
HandoffsPassed int `json:"handoffs_passed"`
HandoffsTotal int `json:"handoffs_total"`
RelevancePassed float64 `json:"relevance_passed"`
RelevanceTotal float64 `json:"relevance_total"`
}
Summary contains aggregate statistics across all evaluations.