Troubleshooting Guide

Use this flow when a run fails or scores look wrong.

1. Identify where failure happened

Check status first:

assert-ai results status <suite> <run>

Inspect manifest.json to confirm the failing stage.

2. Debug low-quality or failing judgments

Start with scores.jsonl and look for:

failing judge dimension
cited evidence turns
behavior category under test
trace/tool references (if present)

Then inspect matching rows in inference_set.jsonl to see the full prompt and response or traces from the AI system inference stage with the generated test set.

3. Debug stage inputs

Often if the inputs are too vague or low-quality, then the resulting output can also lead to failures in the evaluation. Refer to the guidance on structuring high quality inputs in the Best Practices and Limitations documentation.

systematize issues: inspect taxonomy.json
test_set issues: inspect test_set.jsonl and stratification dimensions
inference issues: inspect inference_set.jsonl events and outputs
judge issues: inspect scores.jsonl plus judge dimensions/rubrics in eval_config.yaml

4. Re-run only what changed

If you changed inputs for a stage, force rerun from that stage:

assert-ai run --config <config-path> --force-stage <stage-name>

Common examples:

changed behavior specification: --force-stage systematize
changed dimensions/sample sizing: --force-stage test_set
changed target: --force-stage inference
changed judge rubrics/model: --force-stage judge

5. Common root causes for failures

Missing model credentials in .env file
Target callable import path typo
Non-instrumented target when trace-level evidence is expected
Overly vague judge dimensions and rubrics causing weak verdict evidence
Stale artifacts reused without forcing the correct stage
Mismatched example paths after following older docs; prefer the current examples/prompt_agents/* configs for hosted-model tool execution flows

6. Helpful comparisons

Compare runs to spot regressions:

assert-ai results compare <suite> <run-a> <run-b>
assert-ai results compare-suites <suite-a>/<run-a> <suite-b>/<run-b>

7. Environment-specific fixes

macOS litellm install issue (AttributeError: module 'litellm' has no attribute 'acompletion'): some macOS security tooling can silently truncate the litellm wheel during extraction with uv sync. The pip install -e ".[otel,langgraph]" path above uses copy-based installs and avoids this. If you must use uv, grant your terminal Full Disk Access and run xattr -cr .venv to clear quarantine attributes.
Windows UnicodeEncodeError when running auto-trace demos: set $env:PYTHONUTF8 = "1" before python -m examples.phoenix_auto_trace.travel_openai.
Docker-backed Prompt Agent configs fail with docker daemon unavailable: ensure Docker Desktop is running for examples/prompt_agents/health_assistant_sandbox.yaml and examples/prompt_agents/health_assistant_external.yaml.