evaluation

package

v1.20.0 Latest Latest Go to latest Published: Jan 30, 2026 License: Apache-2.0 Imports: 29 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/docker/cagent

Links

Documentation ¶

Overview ¶

Package evaluation provides an evaluation framework for testing agents.

Index ¶

func GenerateRunName() string
func Save(sess *session.Session, filename string) (string, error)
func SaveRunJSON(run *EvalRun, outputDir string) (string, error)
type Config
type EvalCriteria
type EvalRun
- func Evaluate(ctx context.Context, ttyOut, out io.Writer, isTTY bool, runName string, ...) (*EvalRun, error)
type EvalSession
type Judge
- func NewJudge(model provider.Provider, runConfig *config.RuntimeConfig, concurrency int) *Judge
- func (j *Judge) CheckRelevance(ctx context.Context, response string, criteria []string) (passed int, failed, errs []string)
type Result
type Runner
- func (r *Runner) Run(ctx context.Context, ttyOut, out io.Writer, isTTY bool) ([]Result, error)
type Summary

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func GenerateRunName ¶ added in v1.19.0

func GenerateRunName() string

GenerateRunName creates a memorable name for an evaluation run.

func Save ¶

func Save(sess *session.Session, filename string) (string, error)

func SaveRunJSON ¶ added in v1.19.1

func SaveRunJSON(run *EvalRun, outputDir string) (string, error)

Types ¶

type Config ¶ added in v1.19.0

type Config struct {
	AgentFilename  string   // Path to the agent configuration file
	EvalsDir       string   // Directory containing evaluation files
	JudgeModel     string   // Model for relevance checking (format: provider/model, optional)
	Concurrency    int      // Number of concurrent runs (0 = number of CPUs)
	TTYFd          int      // File descriptor for terminal size queries (e.g., int(os.Stdout.Fd()))
	Only           []string // Only run evaluations matching these patterns
	BaseImage      string   // Custom base Docker image for running evaluations
	KeepContainers bool     // If true, don't remove containers after evaluation (skip --rm)
}

Config holds configuration for evaluation runs.

type EvalCriteria ¶ added in v1.19.0

type EvalCriteria struct {
	Relevance  []string `json:"relevance,omitempty"`   // Statements that should be true about the response
	WorkingDir string   `json:"working_dir,omitempty"` // Subdirectory under evals/working_dirs/
	Size       string   `json:"size,omitempty"`        // Expected response size: S, M, L, XL
}

EvalCriteria contains the evaluation criteria for a test case.

type EvalRun ¶ added in v1.19.0

type EvalRun struct {
	Name      string        `json:"name"`
	Timestamp time.Time     `json:"timestamp"`
	Duration  time.Duration `json:"duration"`
	Results   []Result      `json:"results"`
	Summary   Summary       `json:"summary"`
}

EvalRun contains the results and metadata for an evaluation run.

func Evaluate ¶

func Evaluate(ctx context.Context, ttyOut, out io.Writer, isTTY bool, runName string, runConfig *config.RuntimeConfig, cfg Config) (*EvalRun, error)

Evaluate runs evaluations with a specified run name. ttyOut is used for progress bar rendering (should be the console/TTY). out is used for results and status messages (can be tee'd to a log file).

type EvalSession ¶ added in v1.19.0

type EvalSession struct {
	session.Session
	Evals      EvalCriteria `json:"evals"`
	SourcePath string       `json:"-"` // Path to the source eval file (not serialized)
}

EvalSession extends session.Session with evaluation criteria.

type Judge ¶ added in v1.20.0

type Judge struct {
	// contains filtered or unexported fields
}

Judge runs LLM-as-a-judge relevance checks concurrently.

func NewJudge ¶ added in v1.20.0

func NewJudge(model provider.Provider, runConfig *config.RuntimeConfig, concurrency int) *Judge

NewJudge creates a new Judge that runs relevance checks with the given concurrency. Concurrency defaults to 1 if n < 1.

func (*Judge) CheckRelevance ¶ added in v1.20.0

func (j *Judge) CheckRelevance(ctx context.Context, response string, criteria []string) (passed int, failed, errs []string)

CheckRelevance runs all relevance checks concurrently with the configured concurrency. It returns the number of passed checks, a slice of failed criteria, and any errors encountered.

type Result ¶

type Result struct {
	InputPath         string           `json:"input_path"`
	Title             string           `json:"title"`
	Question          string           `json:"question"`
	Response          string           `json:"response"`
	Cost              float64          `json:"cost"`
	OutputTokens      int64            `json:"output_tokens"`
	Size              string           `json:"size"`
	SizeExpected      string           `json:"size_expected"`
	ToolCallsScore    float64          `json:"tool_calls_score"`
	ToolCallsExpected float64          `json:"tool_calls_score_expected"`
	HandoffsMatch     bool             `json:"handoffs"`
	RelevancePassed   float64          `json:"relevance"`
	RelevanceExpected float64          `json:"relevance_expected"`
	FailedRelevance   []string         `json:"failed_relevance,omitempty"`
	Error             string           `json:"error,omitempty"`
	RawOutput         []map[string]any `json:"raw_output,omitempty"`
}

Result contains the evaluation results for a single test case.

type Runner ¶ added in v1.19.0

type Runner struct {
	Config
	// contains filtered or unexported fields
}

Runner runs evaluations against an agent.

func (*Runner) Run ¶ added in v1.19.0

func (r *Runner) Run(ctx context.Context, ttyOut, out io.Writer, isTTY bool) ([]Result, error)

Run executes all evaluations concurrently and returns results. ttyOut is used for progress bar rendering (should be the console/TTY). out is used for results and status messages (can be tee'd to a log file).

type Summary ¶ added in v1.19.0

type Summary struct {
	TotalEvals      int     `json:"total_evals"`
	FailedEvals     int     `json:"failed_evals"`
	TotalCost       float64 `json:"total_cost"`
	SizesPassed     int     `json:"sizes_passed"`
	SizesTotal      int     `json:"sizes_total"`
	ToolsPassed     float64 `json:"tools_passed"`
	ToolsTotal      float64 `json:"tools_total"`
	HandoffsPassed  int     `json:"handoffs_passed"`
	HandoffsTotal   int     `json:"handoffs_total"`
	RelevancePassed float64 `json:"relevance_passed"`
	RelevanceTotal  float64 `json:"relevance_total"`
}

Summary contains aggregate statistics across all evaluations.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL