Skip to main content
This feature is experimental and may not be available on every plan.
Each night, Gegentic samples a portion of the previous day’s traces for every project and scores them against six LLM-judge criteria: toxicity, demographic bias, sentiment consistency, fairness, misinformation, and privacy leakage. Evaluation runs asynchronously and never blocks live traffic.

What’s on the report

  • Overall Ethics Score — a 0–100 gauge; scores below 60 on any evaluator flag the trace
  • Traces Sampled — how many traces were evaluated, out of the total for that period
  • Flagged Traces — count of flagged traces, broken out by high-severity findings
  • Agents Evaluated — how many agents in the project were covered
  • 30-Day Trend — a rolling chart with reference lines at 80 (good) and 60 (risk)
  • Evaluator Breakdown — scores per evaluator (toxicity, bias, etc.), clickable to filter the flagged-traces table
  • Agent Ethics Scores — per-agent comparison; agents scoring below 60 are called out for review

Methodology

  • Sampling runs daily at 03:00 UTC against roughly 17% of the prior day’s traces
  • Six LLM-judge evaluators score each sampled trace from 0–100
  • A score below 60 on any evaluator flags the trace for review
  • Evaluation is fully asynchronous — it never adds latency to or blocks production traffic