evaluation

package
v1.20.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 30, 2026 License: Apache-2.0 Imports: 29 Imported by: 0

Documentation

Overview

Package evaluation provides an evaluation framework for testing agents.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GenerateRunName added in v1.19.0

func GenerateRunName() string

GenerateRunName creates a memorable name for an evaluation run.

func Save

func Save(sess *session.Session, filename string) (string, error)

func SaveRunJSON added in v1.19.1

func SaveRunJSON(run *EvalRun, outputDir string) (string, error)

Types

type Config added in v1.19.0

type Config struct {
	AgentFilename  string   // Path to the agent configuration file
	EvalsDir       string   // Directory containing evaluation files
	JudgeModel     string   // Model for relevance checking (format: provider/model, optional)
	Concurrency    int      // Number of concurrent runs (0 = number of CPUs)
	TTYFd          int      // File descriptor for terminal size queries (e.g., int(os.Stdout.Fd()))
	Only           []string // Only run evaluations matching these patterns
	BaseImage      string   // Custom base Docker image for running evaluations
	KeepContainers bool     // If true, don't remove containers after evaluation (skip --rm)
}

Config holds configuration for evaluation runs.

type EvalCriteria added in v1.19.0

type EvalCriteria struct {
	Relevance  []string `json:"relevance,omitempty"`   // Statements that should be true about the response
	WorkingDir string   `json:"working_dir,omitempty"` // Subdirectory under evals/working_dirs/
	Size       string   `json:"size,omitempty"`        // Expected response size: S, M, L, XL
}

EvalCriteria contains the evaluation criteria for a test case.

type EvalRun added in v1.19.0

type EvalRun struct {
	Name      string        `json:"name"`
	Timestamp time.Time     `json:"timestamp"`
	Duration  time.Duration `json:"duration"`
	Results   []Result      `json:"results"`
	Summary   Summary       `json:"summary"`
}

EvalRun contains the results and metadata for an evaluation run.

func Evaluate

func Evaluate(ctx context.Context, ttyOut, out io.Writer, isTTY bool, runName string, runConfig *config.RuntimeConfig, cfg Config) (*EvalRun, error)

Evaluate runs evaluations with a specified run name. ttyOut is used for progress bar rendering (should be the console/TTY). out is used for results and status messages (can be tee'd to a log file).

type EvalSession added in v1.19.0

type EvalSession struct {
	session.Session
	Evals      EvalCriteria `json:"evals"`
	SourcePath string       `json:"-"` // Path to the source eval file (not serialized)
}

EvalSession extends session.Session with evaluation criteria.

type Judge added in v1.20.0

type Judge struct {
	// contains filtered or unexported fields
}

Judge runs LLM-as-a-judge relevance checks concurrently.

func NewJudge added in v1.20.0

func NewJudge(model provider.Provider, runConfig *config.RuntimeConfig, concurrency int) *Judge

NewJudge creates a new Judge that runs relevance checks with the given concurrency. Concurrency defaults to 1 if n < 1.

func (*Judge) CheckRelevance added in v1.20.0

func (j *Judge) CheckRelevance(ctx context.Context, response string, criteria []string) (passed int, failed, errs []string)

CheckRelevance runs all relevance checks concurrently with the configured concurrency. It returns the number of passed checks, a slice of failed criteria, and any errors encountered.

type Result

type Result struct {
	InputPath         string           `json:"input_path"`
	Title             string           `json:"title"`
	Question          string           `json:"question"`
	Response          string           `json:"response"`
	Cost              float64          `json:"cost"`
	OutputTokens      int64            `json:"output_tokens"`
	Size              string           `json:"size"`
	SizeExpected      string           `json:"size_expected"`
	ToolCallsScore    float64          `json:"tool_calls_score"`
	ToolCallsExpected float64          `json:"tool_calls_score_expected"`
	HandoffsMatch     bool             `json:"handoffs"`
	RelevancePassed   float64          `json:"relevance"`
	RelevanceExpected float64          `json:"relevance_expected"`
	FailedRelevance   []string         `json:"failed_relevance,omitempty"`
	Error             string           `json:"error,omitempty"`
	RawOutput         []map[string]any `json:"raw_output,omitempty"`
}

Result contains the evaluation results for a single test case.

type Runner added in v1.19.0

type Runner struct {
	Config
	// contains filtered or unexported fields
}

Runner runs evaluations against an agent.

func (*Runner) Run added in v1.19.0

func (r *Runner) Run(ctx context.Context, ttyOut, out io.Writer, isTTY bool) ([]Result, error)

Run executes all evaluations concurrently and returns results. ttyOut is used for progress bar rendering (should be the console/TTY). out is used for results and status messages (can be tee'd to a log file).

type Summary added in v1.19.0

type Summary struct {
	TotalEvals      int     `json:"total_evals"`
	FailedEvals     int     `json:"failed_evals"`
	TotalCost       float64 `json:"total_cost"`
	SizesPassed     int     `json:"sizes_passed"`
	SizesTotal      int     `json:"sizes_total"`
	ToolsPassed     float64 `json:"tools_passed"`
	ToolsTotal      float64 `json:"tools_total"`
	HandoffsPassed  int     `json:"handoffs_passed"`
	HandoffsTotal   int     `json:"handoffs_total"`
	RelevancePassed float64 `json:"relevance_passed"`
	RelevanceTotal  float64 `json:"relevance_total"`
}

Summary contains aggregate statistics across all evaluations.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL