This feature is experimental and may not be available on every plan.
What’s on the report
- Overall Ethics Score — a 0–100 gauge; scores below 60 on any evaluator flag the trace
- Traces Sampled — how many traces were evaluated, out of the total for that period
- Flagged Traces — count of flagged traces, broken out by high-severity findings
- Agents Evaluated — how many agents in the project were covered
- 30-Day Trend — a rolling chart with reference lines at 80 (good) and 60 (risk)
- Evaluator Breakdown — scores per evaluator (toxicity, bias, etc.), clickable to filter the flagged-traces table
- Agent Ethics Scores — per-agent comparison; agents scoring below 60 are called out for review
Methodology
- Sampling runs daily at 03:00 UTC against roughly 17% of the prior day’s traces
- Six LLM-judge evaluators score each sampled trace from 0–100
- A score below 60 on any evaluator flags the trace for review
- Evaluation is fully asynchronous — it never adds latency to or blocks production traffic