Troubleshooting Guide
Use this flow when a run fails or scores look wrong.
1. Identify where failure happened
Check status first:
assert-ai results status <suite> <run>
Inspect manifest.json to confirm the failing stage.
2. Debug low-quality or failing judgments
Start with scores.jsonl and look for:
- failing judge dimension
- cited evidence turns
- behavior category under test
- trace/tool references (if present)
Then inspect matching rows in inference_set.jsonl to see the full prompt and response or traces from the AI system inference stage with the generated test set.
3. Debug stage inputs
Often if the inputs are too vague or low-quality, then the resulting output can also lead to failures in the evaluation. Refer to the guidance on structuring high quality inputs in the Best Practices and Limitations documentation.
systematizeissues: inspecttaxonomy.jsontest_setissues: inspecttest_set.jsonland stratification dimensionsinferenceissues: inspectinference_set.jsonlevents and outputsjudgeissues: inspectscores.jsonlplus judge dimensions/rubrics ineval_config.yaml
4. Re-run only what changed
If you changed inputs for a stage, force rerun from that stage:
assert-ai run --config <config-path> --force-stage <stage-name>
Common examples:
- changed behavior specification:
--force-stage systematize - changed dimensions/sample sizing:
--force-stage test_set - changed target:
--force-stage inference - changed judge rubrics/model:
--force-stage judge
5. Common root causes for failures
- Missing model credentials in
.envfile - Target callable import path typo
- Non-instrumented target when trace-level evidence is expected
- Overly vague judge dimensions and rubrics causing weak verdict evidence
- Stale artifacts reused without forcing the correct stage
- Mismatched example paths after following older docs; prefer the current
examples/prompt_agents/*configs for hosted-model tool execution flows
6. Helpful comparisons
Compare runs to spot regressions:
assert-ai results compare <suite> <run-a> <run-b>
assert-ai results compare-suites <suite-a>/<run-a> <suite-b>/<run-b>
7. Environment-specific fixes
- macOS
litellminstall issue (AttributeError: module 'litellm' has no attribute 'acompletion'): some macOS security tooling can silently truncate thelitellmwheel during extraction withuv sync. Thepip install -e ".[otel,langgraph]"path above uses copy-based installs and avoids this. If you must useuv, grant your terminal Full Disk Access and runxattr -cr .venvto clear quarantine attributes. - Windows
UnicodeEncodeErrorwhen running auto-trace demos: set$env:PYTHONUTF8 = "1"beforepython -m examples.phoenix_auto_trace.travel_openai. - Docker-backed Prompt Agent configs fail with
docker daemon unavailable: ensure Docker Desktop is running forexamples/prompt_agents/health_assistant_sandbox.yamlandexamples/prompt_agents/health_assistant_external.yaml.