mirror of
https://github.com/payloadcms/payload.git
synced 2026-03-25 13:34:28 +00:00
## Overview
The suite tests two complementary things:
- **QA evals** — does the model correctly answer questions about
Payload's API and conventions?
- **Codegen evals** — can the model apply a specific change to a real
`payload.config.ts` file, producing valid TypeScript with the right
outcome?
Codegen evals use a three-step pipeline: `LLM generation` → `TypeScript
compilation` → `LLM scoring`.
## Skills Evaluation
Each QA suite runs in two modes to measure the impact of injecting
`SKILL.md` as passive context:
| Spec file | System prompt | Purpose |
| ------------------------------- | --------------------------------- |
----------------------- |
| `eval.<suite>.spec.ts` | `qaWithSkill` — SKILL.md injected | Primary
eval |
| `eval.<suite>.baseline.spec.ts` | `qaNoSkill` — no context doc |
Baseline for comparison |
Both modes are passive context injection (the document goes directly
into the `system:` field). There is no tool-call indirection. The delta
between the two is a direct measure of what SKILL.md contributes.
> Cache keys include `systemPromptKey`, so `qaWithSkill` and `qaNoSkill`
results are always stored as separate entries and never collide.
## Running the evals
```bash
# Run all evals (with skill, high-power model)
pnpm run test:eval
# Run all evals — baseline (no skill context, high-power model)
pnpm run test:eval -- eval.baseline
# Run a specific suite only
pnpm run test:eval -- eval.config
pnpm run test:eval -- eval.conventions
# Force a fresh run, bypassing the result cache
EVAL_NO_CACHE=true pnpm run test:eval
# Run with an interactive HTML report (opens in browser after run)
pnpm run test:eval:report
# Report for a specific suite
pnpm run test:eval:report -- eval.config
```
`OPENAI_API_KEY` must be set in your environment.
The `test:eval:report` script generates
`test/evals/eval-results/report.html` and serves it locally via Vitest
UI. The file is gitignored.
## Pipelines
### QA Pipeline
```mermaid
flowchart LR
qaCase["EvalCase"]
optFixture["fixture"]
systemPrompt["system prompt\n(qaWithSkill or qaNoSkill)"]
runEval["runEval"]
scoreAnswer["scoreAnswer"]
qaResult["EvalResult"]
qaCase --> runEval
optFixture -->|"injected into prompt"| runEval
systemPrompt --> runEval
runEval --> scoreAnswer
scoreAnswer --> qaResult
```
### Codegen Pipeline
```mermaid
flowchart LR
codegenCase["CodegenEvalCase"]
fixture["fixture"]
runCodegenEval["runCodegenEval"]
tsc["validateConfigTypes"]
scoreConfigChange["scoreConfigChange"]
codegenResult["EvalResult"]
codegenCase --> fixture
fixture --> runCodegenEval
runCodegenEval --> tsc
tsc -->|"valid"| scoreConfigChange
tsc -->|"invalid"| codegenResult
scoreConfigChange --> codegenResult
```
> The tsc check is the hard gate — if the generated TypeScript does not
compile, the case fails immediately without calling the scorer. This
keeps the scorer focused on semantic correctness rather than syntax
errors.
> Codegen always uses the `configModify` system prompt regardless of
skill variant. Codegen cache keys do not include `systemPromptKey`, so
codegen results are shared between `with-skill` and `baseline` runs —
this is intentional and correct.
### Result Caching
```mermaid
flowchart LR
start["Eval"]
cacheCheck{"cache hit?"}
cached["cached EvalResult"]
run["Run full pipeline"]
write["eval-results/cache/<hash>.json"]
done["EvalResult"]
start --> cacheCheck
cacheCheck -->|"yes + EVAL_NO_CACHE unset"| cached
cacheCheck -->|"no or EVAL_NO_CACHE=true"| run
run --> write
write --> done
cached --> done
```
Cache keys include the model ID and (for QA) the `systemPromptKey`, so
the following never collide:
- `eval.spec.ts` (gpt-5.2 + qaWithSkill)
- `eval.baseline.spec.ts` (gpt-5.2 + qaNoSkill)
- `eval.low-power.spec.ts` (gpt-4o + qaWithSkill)
## Token Usage Tracking
Every `EvalResult` includes a `usage` object covering all LLM calls for
that case:
```jsonc
{
"result": {
"pass": true,
"score": 0.92,
"usage": {
"runner": {
"inputTokens": 3499,
"cachedInputTokens": 3328,
"outputTokens": 280,
"totalTokens": 3779,
},
"scorer": {
"inputTokens": 669,
"cachedInputTokens": 0,
"outputTokens": 89,
"totalTokens": 758,
},
"total": {
"inputTokens": 4168,
"cachedInputTokens": 3328,
"outputTokens": 369,
"totalTokens": 4537,
},
},
},
}
```
- **`runner`** — tokens spent generating the answer or modified config.
- **`scorer`** — tokens spent evaluating the result (consistent across
skill variants since the scorer prompt is fixed).
- **`total`** — sum of runner + scorer for full per-case cost.
- **`cachedInputTokens`** — the key signal for skill efficiency.
`qaWithSkill` injects SKILL.md (~3,400 tokens) into every system prompt.
Once the API warms the prompt cache, ~95% of those tokens are
`cachedInputTokens` (billed at a reduced rate), so the net new tokens
per call drops to ~170 — nearly identical to the `qaNoSkill` baseline.
For codegen cases that fail tsc, `scorer` is absent and `total` equals
`runner`.
Usage is stored in the cache alongside the result, so historical runs
retain their token data for cost comparisons across model variants and
skill configurations.
## Negative Tests
The negative suite tests the evaluation pipeline itself as much as the
model:
| Test | What it checks |
| ------------------------ |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| **Detection (QA)** | Given a broken config, does the model identify
the specific error? Expects ≥ 70% accuracy. |
| **Correction (Codegen)** | Given a broken config, does the model fix
the error? tsc must pass after correction. |
| **Invalid instruction** | The model is explicitly told to introduce a
bad field type. The test passes only if tsc catches the error and the
pipeline correctly reports it as a failure. |
The three broken fixtures (`invalid-field-type`,
`invalid-access-return`, `missing-beforechange-return`) are shared by
both the detection and correction datasets.
## Adding a new eval case
**QA case** — add an entry to the appropriate
`datasets/<category>/qa.ts`:
```typescript
{
input: 'How do you configure Payload to send emails?',
expected: 'set the email property in buildConfig with an adapter like nodemailerAdapter',
category: 'config',
}
```
**Codegen case** — create a fixture first, then add the dataset entry:
1. Add `test/evals/fixtures/<category>/codegen/<name>/payload.config.ts`
— a minimal but valid config that gives the LLM context for the specific
task.
2. Add an entry to `datasets/<category>/codegen.ts`:
```typescript
{
input: 'Add a text field named "excerpt" to the posts collection.',
expected: 'text field with name "excerpt" added to posts.fields',
category: 'collections',
fixturePath: 'collections/codegen/<name>',
}
```
The cache key for codegen includes the fixture file's **content** (not
just its path), so updating a fixture automatically invalidates its
cached result.
## Admin
The admin interface for evals has a way of inspecting cached results.
<img width="2318" height="149" alt="image"
src="https://github.com/user-attachments/assets/c8c87387-e65f-40b5-8a8f-54701e26a3c7"
/>
This gives users the ability to find improvements, regressions, and
better understand model capabilities.
<img width="2343" height="794" alt="image"
src="https://github.com/user-attachments/assets/61b41c8c-4802-40c3-a81c-115ed309ae3d"
/>
## Debugging failed cases
Every failed case writes a JSON file to
`eval-results/failed-assertions/<label-slug>/`. For codegen cases this
includes the starter config, the LLM-generated config, tsc errors (if
any), and the scorer's reasoning. For QA cases it includes the question,
expected answer, actual answer, and reasoning.
The generated `.ts` files in `eval-results/<category>/codegen/` show the
last LLM output for each fixture and can be opened directly in the
editor for manual inspection.
---------
Co-authored-by: Elliot DeNolf <denolfe@gmail.com>
366 lines
6.5 KiB
Plaintext
366 lines
6.5 KiB
Plaintext
coverage
|
|
package-lock.json
|
|
dist
|
|
/.idea/*
|
|
!/.idea/runConfigurations
|
|
/.idea/runConfigurations/_template*
|
|
!/.idea/payload.iml
|
|
|
|
# Local AI Agent files
|
|
AGENTS.local.md
|
|
CLAUDE.local.md
|
|
.claude/commands/*.local.md
|
|
.claude/artifacts
|
|
.worktrees/
|
|
|
|
# Custom actions
|
|
!.github/actions/**/dist
|
|
|
|
test/packed
|
|
test-results
|
|
.devcontainer
|
|
.localstack
|
|
/migrations
|
|
.localstack
|
|
.turbo
|
|
|
|
meta_client.json
|
|
meta_server.json
|
|
meta_index.json
|
|
meta_shared.json
|
|
|
|
packages/payload/esbuild
|
|
packages/ui/esbuild
|
|
packages/next/esbuild
|
|
packages/richtext-lexical/esbuild
|
|
|
|
audit_output.json
|
|
|
|
.turbo
|
|
.wrangler
|
|
|
|
# Ignore test directory media folder/files
|
|
/media
|
|
test/media
|
|
*payloadtests.db
|
|
*payloadtests.db-journal
|
|
*payloadtests.db-shm
|
|
*payloadtests.db-wal
|
|
*payload.db
|
|
/versions
|
|
no-restrict-file-*
|
|
|
|
# Created by https://www.toptal.com/developers/gitignore/api/node,macos,windows,webstorm,sublimetext,visualstudiocode
|
|
# Edit at https://www.toptal.com/developers/gitignore?templates=node,macos,windows,webstorm,sublimetext,visualstudiocode
|
|
|
|
### macOS ###
|
|
# General
|
|
.DS_Store
|
|
.AppleDouble
|
|
.LSOverride
|
|
|
|
# Thumbnails
|
|
._*
|
|
|
|
# Files that might appear in the root of a volume
|
|
.DocumentRevisions-V100
|
|
.fseventsd
|
|
.Spotlight-V100
|
|
.TemporaryItems
|
|
.Trashes
|
|
.VolumeIcon.icns
|
|
.com.apple.timemachine.donotpresent
|
|
|
|
# Directories potentially created on remote AFP share
|
|
.AppleDB
|
|
.AppleDesktop
|
|
Network Trash Folder
|
|
Temporary Items
|
|
.apdisk
|
|
|
|
### macOS Patch ###
|
|
# iCloud generated files
|
|
*.icloud
|
|
|
|
### Node ###
|
|
# Logs
|
|
logs
|
|
*.log
|
|
npm-debug.log*
|
|
yarn-debug.log*
|
|
yarn-error.log*
|
|
lerna-debug.log*
|
|
.pnpm-debug.log*
|
|
|
|
# Diagnostic reports (https://nodejs.org/api/report.html)
|
|
report.[0-9]*.[0-9]*.[0-9]*.[0-9]*.json
|
|
|
|
# Runtime data
|
|
pids
|
|
*.pid
|
|
*.seed
|
|
*.pid.lock
|
|
|
|
# Directory for instrumented libs generated by jscoverage/JSCover
|
|
lib-cov
|
|
|
|
# Coverage directory used by tools like istanbul
|
|
coverage
|
|
*.lcov
|
|
|
|
# nyc test coverage
|
|
.nyc_output
|
|
|
|
# Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files)
|
|
.grunt
|
|
|
|
# Bower dependency directory (https://bower.io/)
|
|
bower_components
|
|
|
|
# node-waf configuration
|
|
.lock-wscript
|
|
|
|
# Compiled binary addons (https://nodejs.org/api/addons.html)
|
|
build/Release
|
|
|
|
# Dependency directories
|
|
node_modules/
|
|
jspm_packages/
|
|
|
|
# Snowpack dependency directory (https://snowpack.dev/)
|
|
web_modules/
|
|
|
|
# TypeScript cache
|
|
*.tsbuildinfo
|
|
|
|
# Optional npm cache directory
|
|
.npm
|
|
|
|
# Optional eslint cache
|
|
.eslintcache
|
|
|
|
# Optional stylelint cache
|
|
.stylelintcache
|
|
|
|
# Microbundle cache
|
|
.rpt2_cache/
|
|
.rts2_cache_cjs/
|
|
.rts2_cache_es/
|
|
.rts2_cache_umd/
|
|
|
|
# Optional REPL history
|
|
.node_repl_history
|
|
|
|
# Output of 'npm pack'
|
|
*.tgz
|
|
|
|
# Yarn Integrity file
|
|
.yarn-integrity
|
|
|
|
# dotenv environment variable files
|
|
.env
|
|
.env.development.local
|
|
.env.test.local
|
|
.env.production.local
|
|
.env.local
|
|
|
|
# parcel-bundler cache (https://parceljs.org/)
|
|
.cache
|
|
.parcel-cache
|
|
|
|
# Next.js build output
|
|
.next
|
|
next-env.d.ts
|
|
out
|
|
|
|
# Nuxt.js build / generate output
|
|
.nuxt
|
|
dist
|
|
dist_optimized
|
|
|
|
# Gatsby files
|
|
.cache/
|
|
# Comment in the public line in if your project uses Gatsby and not Next.js
|
|
# https://nextjs.org/blog/next-9-1#public-directory-support
|
|
# public
|
|
|
|
# vuepress build output
|
|
.vuepress/dist
|
|
|
|
# vuepress v2.x temp and cache directory
|
|
.temp
|
|
|
|
# Docusaurus cache and generated files
|
|
.docusaurus
|
|
|
|
# Serverless directories
|
|
.serverless/
|
|
|
|
# FuseBox cache
|
|
.fusebox/
|
|
|
|
# DynamoDB Local files
|
|
.dynamodb/
|
|
|
|
# TernJS port file
|
|
.tern-port
|
|
|
|
# Stores VSCode versions used for testing VSCode extensions
|
|
.vscode-test
|
|
|
|
# yarn v2
|
|
.yarn/cache
|
|
.yarn/unplugged
|
|
.yarn/build-state.yml
|
|
.yarn/install-state.gz
|
|
.pnp.*
|
|
|
|
### Node Patch ###
|
|
# Serverless Webpack directories
|
|
.webpack/
|
|
|
|
# Optional stylelint cache
|
|
|
|
# SvelteKit build / generate output
|
|
.svelte-kit
|
|
|
|
### SublimeText ###
|
|
# Cache files for Sublime Text
|
|
*.tmlanguage.cache
|
|
*.tmPreferences.cache
|
|
*.stTheme.cache
|
|
|
|
# Workspace files are user-specific
|
|
*.sublime-workspace
|
|
|
|
# Project files should be checked into the repository, unless a significant
|
|
# proportion of contributors will probably not be using Sublime Text
|
|
# *.sublime-project
|
|
|
|
# SFTP configuration file
|
|
sftp-config.json
|
|
sftp-config-alt*.json
|
|
|
|
# Package control specific files
|
|
Package Control.last-run
|
|
Package Control.ca-list
|
|
Package Control.ca-bundle
|
|
Package Control.system-ca-bundle
|
|
Package Control.cache/
|
|
Package Control.ca-certs/
|
|
Package Control.merged-ca-bundle
|
|
Package Control.user-ca-bundle
|
|
oscrypto-ca-bundle.crt
|
|
bh_unicode_properties.cache
|
|
|
|
# Sublime-github package stores a github token in this file
|
|
# https://packagecontrol.io/packages/sublime-github
|
|
GitHub.sublime-settings
|
|
|
|
### VisualStudioCode ###
|
|
.vscode/*
|
|
!.vscode/settings.json
|
|
!.vscode/tasks.json
|
|
!.vscode/launch.json
|
|
!.vscode/extensions.json
|
|
!.vscode/*.code-snippets
|
|
!.vscode/mcp.json
|
|
|
|
# Local History for Visual Studio Code
|
|
.history/
|
|
|
|
# Built Visual Studio Code Extensions
|
|
*.vsix
|
|
|
|
### VisualStudioCode Patch ###
|
|
# Ignore all local history of files
|
|
.history
|
|
.ionide
|
|
|
|
# CMake
|
|
cmake-build-*/
|
|
|
|
# File-based project format
|
|
*.iws
|
|
|
|
# IntelliJ
|
|
out/
|
|
|
|
# JIRA plugin
|
|
atlassian-ide-plugin.xml
|
|
|
|
# Crashlytics plugin (for Android Studio and IntelliJ)
|
|
com_crashlytics_export_strings.xml
|
|
crashlytics.properties
|
|
crashlytics-build.properties
|
|
fabric.properties
|
|
|
|
### Windows ###
|
|
# Windows thumbnail cache files
|
|
Thumbs.db
|
|
Thumbs.db:encryptable
|
|
ehthumbs.db
|
|
ehthumbs_vista.db
|
|
|
|
# Dump file
|
|
*.stackdump
|
|
|
|
# Folder config file
|
|
[Dd]esktop.ini
|
|
|
|
# Recycle Bin used on file shares
|
|
$RECYCLE.BIN/
|
|
|
|
# Windows Installer files
|
|
*.cab
|
|
*.msi
|
|
*.msix
|
|
*.msm
|
|
*.msp
|
|
|
|
# Windows shortcuts
|
|
*.lnk
|
|
|
|
# End of https://www.toptal.com/developers/gitignore/api/node,macos,windows,webstorm,sublimetext,visualstudiocode
|
|
|
|
/build
|
|
.swc
|
|
app/(payload)/admin/importMap.js
|
|
test/admin-bar/app/(payload)/admin/importMap.js
|
|
/test/admin-bar/app/(payload)/admin/importMap.js
|
|
test/live-preview/app/(payload)/admin/importMap.js
|
|
/test/live-preview/app/(payload)/admin/importMap.js
|
|
test/admin-root/app/(payload)/admin/importMap.js
|
|
/test/admin-root/app/(payload)/admin/importMap.js
|
|
test/app/(payload)/admin/importMap.js
|
|
/test/app/(payload)/admin/importMap.js
|
|
test/plugin-ecommerce/app/(payload)/admin/importMap.js
|
|
/test/plugin-ecommerce/app/(payload)/admin/importMap.js
|
|
test/pnpm-lock.yaml
|
|
test/databaseAdapter.js
|
|
/filename-compound-index
|
|
/media-with-relation-preview
|
|
/media-without-relation-preview
|
|
/media-without-cache-tags
|
|
test/.localstack
|
|
test/google-cloud-storage
|
|
test/azurestoragedata/
|
|
/media-without-delete-access
|
|
/media-documents
|
|
/media-with-always-insert-fields
|
|
|
|
|
|
|
|
licenses.csv
|
|
|
|
# SQLite DB
|
|
payload.db
|
|
|
|
# Screenshots created by Playwright MCP
|
|
.playwright-mcp
|
|
|
|
# Vitest HTML report generated by test:eval:report
|
|
test/evals/eval-results/report.html
|
|
# Versioned eval run snapshots (local only — used for run comparison in dashboard)
|
|
test/evals/eval-results/runs/
|