Run the Local UI Viewer Application

ASSERT includes a local web app for browsing evaluation artifacts with a richly rendered visualization UI. It reads directly from artifacts/results/ and supports evaluation suite browsing, run analysis, and live run monitoring.

The viewer reads from the filesystem on each request. There is no database or run-launch API.

Prerequisites

Node.js 18+
Evaluation artifacts in artifacts/results/ (from assert-ai run)

Run in development

cd viewer
npm install
npm run dev

The dev server starts at http://localhost:5174.

What the viewer shows

suite list with taxonomy and test-case counts
taxonomy browser
prompt browser (single-turn cases)
scenario browser (multi-turn transcripts)
run comparison views
dimension breakdowns
inference preview while runs are in progress
live run monitor from manifest.json

For more information on the layout of the local UI viewer application, see how to use the local viewer.

Build and preview

cd viewer
npm run build
npm run preview

Type checking

cd viewer
npm run check

Required artifacts

The viewer expects this layout per evaluation suite:

artifacts/results/<suite>/
├── taxonomy.json
├── systematization.json   # optional
├── test_set.jsonl
├── suite.json
└── <run>/
 ├── manifest.json
 ├── config.yaml
 ├── inference_set.jsonl
 ├── scores.jsonl
 ├── viewer_run_manifest.json        # completed judged runs
 ├── viewer_prompt_rows.json         # completed judged runs
 ├── viewer_audit_rows.json          # completed judged runs
 ├── viewer_transcript_index.json    # completed inferences
 └── viewer_score_index.json         # completed judged runs

Missing files expected for incomplete runs are handled where appropriate. Invalid JSON, JSONL, or YAML is treated as an artifact error and should be fixed or re-generated.

One exception exists for live inference: while manifest.stages.inference == "running", the viewer tolerates one malformed trailing segment in inference_set.jsonl so it can read already-written rows before the current append finishes.

Read-model behavior and refresh

Completed judged runs are served from run-level viewer read-model files, not by rescanning canonical JSONL on every request.

If viewer_run_manifest.json is missing or stale, rebuild by re-running judge for that run:

assert-ai run --config artifacts/results/<suite>/<run>/config.yaml --resume --force-stage judge

Expected verdict contract

The viewer expects each successful score row to include:

verdict.dimensions with binary event flags or declared ordinal integer/string grades for the configured judge dimensions (by default policy_violation and overrefusal are binary). Dimensions configured with allow_not_applicable: true may use null plus verdict.dimension_applicability.<name>: false for N/A cases.
dimension_scales when a score row contains ordinal grades. The viewer reads the scale and grade labels from the artifact instead of relying on the current config or a static dimension registry.
verdict.dimension_applicability when any dimension is N/A; aggregate metrics exclude N/A rows from that dimension's denominator.
verdict.dimension_justifications for every dimension in verdict.dimensions
verdict.node_judgments in taxonomy order with node_name matching taxonomy.json names
verdict.citations used by inline [N] evidence markers

Rows that fail this strict contract (for example, policy_compliance-only rows) are not treated as valid scored judgments.

Evidence drawer behavior

Explanation text can contain [N] citation chips that jump to cited transcript messages and highlight stored spans. Turn labels remain visible, but Turn N is not linkified, and the old separate Evidence block is not used for new structured artifacts.

Code layout

src/lib/server/artifacts.ts: artifact reads, path validation, and missing-vs-invalid handling
src/lib/server/data.ts: page-facing view models
src/lib/server/metrics.ts: prompt/scenario aggregates
src/lib/server/run-status.ts: live monitor payloads from manifest.json
src/routes/*: route handlers and page orchestration
src/lib/*: shared UI helpers (citations, audit grouping, run polling, suite grouping)