View results in Local UI Viewer Application

Though portable local artifacts can be a powerful feature, ASSERT comes with a local-hosted UI viewer web application that helps richly render the results and artifacts to help with analysis.

Evaluation Suite list

This is the first page you'll land on. Use the Evaluation Suite list page to quickly scan available suites and jump to a specific suite that includes taxonomies, test sets and evaluation results.

Viewer suite list

You can also click "New evaluation" button on the top right to walk through a guided UI wizard to author an evaluation config and create a new evaluation suite.

Create a new evaluation flow

The "Create new evaluation" flow guides you through selecting source artifacts and run settings. In just three steps, set up the whole evaluation pipeline and hit run.

1. Input specification

First, define your behavior name and description, or re-use one that you've already run before from a different evaluation suite. Select your target type that you'd like to evaluate (currently a hosted model or prompt agent supported only), and fill in the application context and system prompt used by your model or prompt agent.

Create new run step 1

2. Category and evaluation set

Next, set up your evaluation pipeline: including the systematization + test set generation + judge pipeline stages. Define what models you want to use for each step of the pipeline, along with its parameters.

Create new run step 2

3. Summary and submit

Finally, review your evaluation configuration, and submit the run. You'll be redirected to a monitoring page to keep track of your run status and when the pipeline is done running.

Create new run step 3

Evaluation suite overview tabs

The suite details page contains tabs for taxonomy, test set content, and run-level evaluation results.

1. Review taxonomy

Take a look at the generated taxonomy and the encoded policies (permissable/not permissable), with the definition of each behavior category and it's polocy label.

Suite taxonomy tab

2. Review generated test set

Browse the single turn prompt test cases and multi-turn scenarios generated by the taxonomy. If you want to easily share a .csv file, you can download the test set directly to your local file.

Suite test set tab

3. Review Evaluation results

Take a look at all the evaluation runs completed in the evaluation suite, along with its high level metrics.

Suite evaluation results overview

Evaluation run summary and result tables

Within a single run, the viewer exposes a high-level summary plus row-level result drilldowns.

Run summary

View all the rows of the evaluation run as a flat list or by judge dimensions.

Run rows view

When you click on a specific row, the viewer will pop up a detailed view of each interaction with the judge verdict, confidence and citations.

Run verdict table

Compare runs

Use the compare view to inspect differences across runs side by side. Optionally, toggle to see where there are disagreements.

Compare runs view