Results and artifacts

ASSERT writes local artifacts and evaluation results under the artifacts folder, sorted by the evaluation suites (configured for each evaluation config YAML specification):

artifacts/results/<suite>/

Run-level outputs are located under each evaluation suite:

artifacts/results/<suite>/<run>/

Artifact layout and description

artifacts/results/<suite>/
├── suite.json
├── taxonomy.json
├── test_set.jsonl
└── <run>/
    ├── manifest.json
    ├── config.yaml
    ├── inference_set.jsonl
    ├── scores.jsonl
    └── metrics.json

suite.json: evaluation suite metadata
taxonomy.json: behavior categories generated from your evaluation config YAML in the systematization step of the pipeline.
test_set.jsonl: single turn prompt and multi-turn scenario test cases generated by the test set generation step of the pipeline
manifest.json: stage-by-stage run status and timestamps
config.yaml: frozen config snapshot used for this run
inference_set.jsonl: target outputs plus trace references/events
scores.jsonl: per-case judge verdicts, dimensions, and evidence
metrics.json: aggregate rates by dimension and category, along with token usage metadata

Tip: After a run, start with metrics.json first then see the scores.jsonl before inspecting the inference_set.jsonl more closely.

Interpreting dimension summaries

Boolean judge dimensions report clear and flagged counts plus a flagged rate. Ordinal dimensions report counts and percentages for each declared integer or string grade plus the median. Numeric scales also report the mean; named grades do not. Ordinal dimensions do not report a violation rate.

Both summary types use applicable scored rows as the denominator. For dimensions configured with allow_not_applicable: true, rows where the judge returns null and dimension_applicability.<name>: false are counted as not applicable, preserved in scores.jsonl, and excluded from the graded denominator. Judge and pipeline failures are reported separately from semantic N/A.

The run viewer shows the full custom-grade distribution and groups semantic N/A plus execution failures under a Not graded row while preserving their separate counts:

Custom rubric scale run summary

Useful CLI commands for viewing results

assert-ai results list
assert-ai results status <suite>
assert-ai results status <suite> <run>
assert-ai results compare <suite> <run-a> <run-b>
assert-ai results compare-suites <suite-a>/<run-a> <suite-b>/<run-b>

See CLI Commands for full options.

View evaluation suite artifacts and run results in a local UI app

Access a rich inspector and editing application to view run status, evaluation suite artifacts such as richly rendered taxonomy of behavior categories and their associated policy labels.

cd viewer
npm install
npm run dev

The local hosted UI application server starts at http://localhost:5174. Paste this into your browser to open up the inspector view.