Files
Kendell db4b00ecf7 feat: add LLM eval suite for Payload conventions and code generation (#15710)
## Overview

The suite tests two complementary things:

- **QA evals** — does the model correctly answer questions about
Payload's API and conventions?
- **Codegen evals** — can the model apply a specific change to a real
`payload.config.ts` file, producing valid TypeScript with the right
outcome?

Codegen evals use a three-step pipeline: `LLM generation` → `TypeScript
compilation` → `LLM scoring`.

## Skills Evaluation

Each QA suite runs in two modes to measure the impact of injecting
`SKILL.md` as passive context:

| Spec file | System prompt | Purpose |
| ------------------------------- | --------------------------------- |
----------------------- |
| `eval.<suite>.spec.ts` | `qaWithSkill` — SKILL.md injected | Primary
eval |
| `eval.<suite>.baseline.spec.ts` | `qaNoSkill` — no context doc |
Baseline for comparison |

Both modes are passive context injection (the document goes directly
into the `system:` field). There is no tool-call indirection. The delta
between the two is a direct measure of what SKILL.md contributes.

> Cache keys include `systemPromptKey`, so `qaWithSkill` and `qaNoSkill`
results are always stored as separate entries and never collide.

## Running the evals

```bash
# Run all evals (with skill, high-power model)
pnpm run test:eval

# Run all evals — baseline (no skill context, high-power model)
pnpm run test:eval -- eval.baseline

# Run a specific suite only
pnpm run test:eval -- eval.config
pnpm run test:eval -- eval.conventions

# Force a fresh run, bypassing the result cache
EVAL_NO_CACHE=true pnpm run test:eval

# Run with an interactive HTML report (opens in browser after run)
pnpm run test:eval:report

# Report for a specific suite
pnpm run test:eval:report -- eval.config
```

`OPENAI_API_KEY` must be set in your environment.

The `test:eval:report` script generates
`test/evals/eval-results/report.html` and serves it locally via Vitest
UI. The file is gitignored.

## Pipelines

### QA Pipeline

```mermaid
flowchart LR
    qaCase["EvalCase"]
    optFixture["fixture"]
    systemPrompt["system prompt\n(qaWithSkill or qaNoSkill)"]
    runEval["runEval"]
    scoreAnswer["scoreAnswer"]
    qaResult["EvalResult"]

    qaCase --> runEval
    optFixture -->|"injected into prompt"| runEval
    systemPrompt --> runEval
    runEval --> scoreAnswer
    scoreAnswer --> qaResult
```

### Codegen Pipeline

```mermaid
flowchart LR
    codegenCase["CodegenEvalCase"]
    fixture["fixture"]
    runCodegenEval["runCodegenEval"]
    tsc["validateConfigTypes"]
    scoreConfigChange["scoreConfigChange"]
    codegenResult["EvalResult"]

    codegenCase --> fixture
    fixture --> runCodegenEval
    runCodegenEval --> tsc
    tsc -->|"valid"| scoreConfigChange
    tsc -->|"invalid"| codegenResult
    scoreConfigChange --> codegenResult
```

> The tsc check is the hard gate — if the generated TypeScript does not
compile, the case fails immediately without calling the scorer. This
keeps the scorer focused on semantic correctness rather than syntax
errors.

> Codegen always uses the `configModify` system prompt regardless of
skill variant. Codegen cache keys do not include `systemPromptKey`, so
codegen results are shared between `with-skill` and `baseline` runs —
this is intentional and correct.

### Result Caching

```mermaid
flowchart LR
    start["Eval"]
    cacheCheck{"cache hit?"}
    cached["cached EvalResult"]
    run["Run full pipeline"]
    write["eval-results/cache/<hash>.json"]
    done["EvalResult"]

    start --> cacheCheck
    cacheCheck -->|"yes + EVAL_NO_CACHE unset"| cached
    cacheCheck -->|"no or EVAL_NO_CACHE=true"| run
    run --> write
    write --> done
    cached --> done
```

Cache keys include the model ID and (for QA) the `systemPromptKey`, so
the following never collide:

- `eval.spec.ts` (gpt-5.2 + qaWithSkill)
- `eval.baseline.spec.ts` (gpt-5.2 + qaNoSkill)
- `eval.low-power.spec.ts` (gpt-4o + qaWithSkill)

## Token Usage Tracking

Every `EvalResult` includes a `usage` object covering all LLM calls for
that case:

```jsonc
{
  "result": {
    "pass": true,
    "score": 0.92,
    "usage": {
      "runner": {
        "inputTokens": 3499,
        "cachedInputTokens": 3328,
        "outputTokens": 280,
        "totalTokens": 3779,
      },
      "scorer": {
        "inputTokens": 669,
        "cachedInputTokens": 0,
        "outputTokens": 89,
        "totalTokens": 758,
      },
      "total": {
        "inputTokens": 4168,
        "cachedInputTokens": 3328,
        "outputTokens": 369,
        "totalTokens": 4537,
      },
    },
  },
}
```

- **`runner`** — tokens spent generating the answer or modified config.
- **`scorer`** — tokens spent evaluating the result (consistent across
skill variants since the scorer prompt is fixed).
- **`total`** — sum of runner + scorer for full per-case cost.
- **`cachedInputTokens`** — the key signal for skill efficiency.
`qaWithSkill` injects SKILL.md (~3,400 tokens) into every system prompt.
Once the API warms the prompt cache, ~95% of those tokens are
`cachedInputTokens` (billed at a reduced rate), so the net new tokens
per call drops to ~170 — nearly identical to the `qaNoSkill` baseline.

For codegen cases that fail tsc, `scorer` is absent and `total` equals
`runner`.

Usage is stored in the cache alongside the result, so historical runs
retain their token data for cost comparisons across model variants and
skill configurations.

## Negative Tests

The negative suite tests the evaluation pipeline itself as much as the
model:

| Test | What it checks |
| ------------------------ |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| **Detection (QA)** | Given a broken config, does the model identify
the specific error? Expects ≥ 70% accuracy. |
| **Correction (Codegen)** | Given a broken config, does the model fix
the error? tsc must pass after correction. |
| **Invalid instruction** | The model is explicitly told to introduce a
bad field type. The test passes only if tsc catches the error and the
pipeline correctly reports it as a failure. |

The three broken fixtures (`invalid-field-type`,
`invalid-access-return`, `missing-beforechange-return`) are shared by
both the detection and correction datasets.

## Adding a new eval case

**QA case** — add an entry to the appropriate
`datasets/<category>/qa.ts`:

```typescript
{
  input: 'How do you configure Payload to send emails?',
  expected: 'set the email property in buildConfig with an adapter like nodemailerAdapter',
  category: 'config',
}
```

**Codegen case** — create a fixture first, then add the dataset entry:

1. Add `test/evals/fixtures/<category>/codegen/<name>/payload.config.ts`
— a minimal but valid config that gives the LLM context for the specific
task.
2. Add an entry to `datasets/<category>/codegen.ts`:

```typescript
{
  input: 'Add a text field named "excerpt" to the posts collection.',
  expected: 'text field with name "excerpt" added to posts.fields',
  category: 'collections',
  fixturePath: 'collections/codegen/<name>',
}
```

The cache key for codegen includes the fixture file's **content** (not
just its path), so updating a fixture automatically invalidates its
cached result.

## Admin
The admin interface for evals has a way of inspecting cached results.
<img width="2318" height="149" alt="image"
src="https://github.com/user-attachments/assets/c8c87387-e65f-40b5-8a8f-54701e26a3c7"
/>

This gives users the ability to find improvements, regressions, and
better understand model capabilities.
<img width="2343" height="794" alt="image"
src="https://github.com/user-attachments/assets/61b41c8c-4802-40c3-a81c-115ed309ae3d"
/>


## Debugging failed cases

Every failed case writes a JSON file to
`eval-results/failed-assertions/<label-slug>/`. For codegen cases this
includes the starter config, the LLM-generated config, tsc errors (if
any), and the scorer's reasoning. For QA cases it includes the question,
expected answer, actual answer, and reasoning.

The generated `.ts` files in `eval-results/<category>/codegen/` show the
last LLM output for each fixture and can be opened directly in the
editor for manual inspection.

---------

Co-authored-by: Elliot DeNolf <denolfe@gmail.com>
2026-03-24 17:10:19 -04:00

366 lines
6.5 KiB
Plaintext

coverage
package-lock.json
dist
/.idea/*
!/.idea/runConfigurations
/.idea/runConfigurations/_template*
!/.idea/payload.iml
# Local AI Agent files
AGENTS.local.md
CLAUDE.local.md
.claude/commands/*.local.md
.claude/artifacts
.worktrees/
# Custom actions
!.github/actions/**/dist
test/packed
test-results
.devcontainer
.localstack
/migrations
.localstack
.turbo
meta_client.json
meta_server.json
meta_index.json
meta_shared.json
packages/payload/esbuild
packages/ui/esbuild
packages/next/esbuild
packages/richtext-lexical/esbuild
audit_output.json
.turbo
.wrangler
# Ignore test directory media folder/files
/media
test/media
*payloadtests.db
*payloadtests.db-journal
*payloadtests.db-shm
*payloadtests.db-wal
*payload.db
/versions
no-restrict-file-*
# Created by https://www.toptal.com/developers/gitignore/api/node,macos,windows,webstorm,sublimetext,visualstudiocode
# Edit at https://www.toptal.com/developers/gitignore?templates=node,macos,windows,webstorm,sublimetext,visualstudiocode
### macOS ###
# General
.DS_Store
.AppleDouble
.LSOverride
# Thumbnails
._*
# Files that might appear in the root of a volume
.DocumentRevisions-V100
.fseventsd
.Spotlight-V100
.TemporaryItems
.Trashes
.VolumeIcon.icns
.com.apple.timemachine.donotpresent
# Directories potentially created on remote AFP share
.AppleDB
.AppleDesktop
Network Trash Folder
Temporary Items
.apdisk
### macOS Patch ###
# iCloud generated files
*.icloud
### Node ###
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
lerna-debug.log*
.pnpm-debug.log*
# Diagnostic reports (https://nodejs.org/api/report.html)
report.[0-9]*.[0-9]*.[0-9]*.[0-9]*.json
# Runtime data
pids
*.pid
*.seed
*.pid.lock
# Directory for instrumented libs generated by jscoverage/JSCover
lib-cov
# Coverage directory used by tools like istanbul
coverage
*.lcov
# nyc test coverage
.nyc_output
# Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files)
.grunt
# Bower dependency directory (https://bower.io/)
bower_components
# node-waf configuration
.lock-wscript
# Compiled binary addons (https://nodejs.org/api/addons.html)
build/Release
# Dependency directories
node_modules/
jspm_packages/
# Snowpack dependency directory (https://snowpack.dev/)
web_modules/
# TypeScript cache
*.tsbuildinfo
# Optional npm cache directory
.npm
# Optional eslint cache
.eslintcache
# Optional stylelint cache
.stylelintcache
# Microbundle cache
.rpt2_cache/
.rts2_cache_cjs/
.rts2_cache_es/
.rts2_cache_umd/
# Optional REPL history
.node_repl_history
# Output of 'npm pack'
*.tgz
# Yarn Integrity file
.yarn-integrity
# dotenv environment variable files
.env
.env.development.local
.env.test.local
.env.production.local
.env.local
# parcel-bundler cache (https://parceljs.org/)
.cache
.parcel-cache
# Next.js build output
.next
next-env.d.ts
out
# Nuxt.js build / generate output
.nuxt
dist
dist_optimized
# Gatsby files
.cache/
# Comment in the public line in if your project uses Gatsby and not Next.js
# https://nextjs.org/blog/next-9-1#public-directory-support
# public
# vuepress build output
.vuepress/dist
# vuepress v2.x temp and cache directory
.temp
# Docusaurus cache and generated files
.docusaurus
# Serverless directories
.serverless/
# FuseBox cache
.fusebox/
# DynamoDB Local files
.dynamodb/
# TernJS port file
.tern-port
# Stores VSCode versions used for testing VSCode extensions
.vscode-test
# yarn v2
.yarn/cache
.yarn/unplugged
.yarn/build-state.yml
.yarn/install-state.gz
.pnp.*
### Node Patch ###
# Serverless Webpack directories
.webpack/
# Optional stylelint cache
# SvelteKit build / generate output
.svelte-kit
### SublimeText ###
# Cache files for Sublime Text
*.tmlanguage.cache
*.tmPreferences.cache
*.stTheme.cache
# Workspace files are user-specific
*.sublime-workspace
# Project files should be checked into the repository, unless a significant
# proportion of contributors will probably not be using Sublime Text
# *.sublime-project
# SFTP configuration file
sftp-config.json
sftp-config-alt*.json
# Package control specific files
Package Control.last-run
Package Control.ca-list
Package Control.ca-bundle
Package Control.system-ca-bundle
Package Control.cache/
Package Control.ca-certs/
Package Control.merged-ca-bundle
Package Control.user-ca-bundle
oscrypto-ca-bundle.crt
bh_unicode_properties.cache
# Sublime-github package stores a github token in this file
# https://packagecontrol.io/packages/sublime-github
GitHub.sublime-settings
### VisualStudioCode ###
.vscode/*
!.vscode/settings.json
!.vscode/tasks.json
!.vscode/launch.json
!.vscode/extensions.json
!.vscode/*.code-snippets
!.vscode/mcp.json
# Local History for Visual Studio Code
.history/
# Built Visual Studio Code Extensions
*.vsix
### VisualStudioCode Patch ###
# Ignore all local history of files
.history
.ionide
# CMake
cmake-build-*/
# File-based project format
*.iws
# IntelliJ
out/
# JIRA plugin
atlassian-ide-plugin.xml
# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties
### Windows ###
# Windows thumbnail cache files
Thumbs.db
Thumbs.db:encryptable
ehthumbs.db
ehthumbs_vista.db
# Dump file
*.stackdump
# Folder config file
[Dd]esktop.ini
# Recycle Bin used on file shares
$RECYCLE.BIN/
# Windows Installer files
*.cab
*.msi
*.msix
*.msm
*.msp
# Windows shortcuts
*.lnk
# End of https://www.toptal.com/developers/gitignore/api/node,macos,windows,webstorm,sublimetext,visualstudiocode
/build
.swc
app/(payload)/admin/importMap.js
test/admin-bar/app/(payload)/admin/importMap.js
/test/admin-bar/app/(payload)/admin/importMap.js
test/live-preview/app/(payload)/admin/importMap.js
/test/live-preview/app/(payload)/admin/importMap.js
test/admin-root/app/(payload)/admin/importMap.js
/test/admin-root/app/(payload)/admin/importMap.js
test/app/(payload)/admin/importMap.js
/test/app/(payload)/admin/importMap.js
test/plugin-ecommerce/app/(payload)/admin/importMap.js
/test/plugin-ecommerce/app/(payload)/admin/importMap.js
test/pnpm-lock.yaml
test/databaseAdapter.js
/filename-compound-index
/media-with-relation-preview
/media-without-relation-preview
/media-without-cache-tags
test/.localstack
test/google-cloud-storage
test/azurestoragedata/
/media-without-delete-access
/media-documents
/media-with-always-insert-fields
licenses.csv
# SQLite DB
payload.db
# Screenshots created by Playwright MCP
.playwright-mcp
# Vitest HTML report generated by test:eval:report
test/evals/eval-results/report.html
# Versioned eval run snapshots (local only — used for run comparison in dashboard)
test/evals/eval-results/runs/