
## Trigger

- Defining or reporting metrics for any research evaluation (benchmark, ablation,
  paper-facing result).
- Tempted to invent a proxy because the standard metric is inconvenient or unavailable.

## Fix

- Use the metric set that is standard in the **current SOTA papers** of that exact
  subfield. **Re-derive it from recent top-venue work each time** — do NOT hardcode a
  permanent list; the standard set evolves and better metrics keep appearing.
- Procedure (metric-agnostic, survives metric churn):
  1. Name the metric by pointing to >=1-2 recent SOTA papers using it. Can't cite one
     → it is probably custom → reconsider.
  2. Reuse the reproduction repo's existing metric implementation if present.
  3. Only build a new metric as a justified last resort, defended against the standard one.
- Illustrative examples ONLY, **as of 2026 (will change — re-check before using)**:
  LLM-safety attack-success via Llama-Guard / HarmBench / StrongREJECT; utility/fluency
  via perplexity / MMLU-retention / MT-Bench / a reward model (HelpSteer / QRM).

## Reuse Rule

- Load when planning any eval, or before reporting numbers as evidence.
- Origin: 2026-06 anchor-emergence text experiment — invented `URc` (coherence-gated
  unsafe-and-coherent rate) and a `distinct-2 / degenerate%` fluency proxy; flagged as
  reinventing the wheel. Custom metrics are not comparable, reviewable, or trustworthy
  even when the qualitative result is right.
