---
id: s2-scoring-system
title: Scoring System
module: GROW-S2
module_slug: grow-s2-evaluation-auditability
cluster: Systems
type: spec
version: v0.2.2
status: Draft
tier: membership
contract_role: ""
canonical_url: "https://grow.goodcombinator.ai/library/registry/s2-scoring-system"
download_url: "https://grow.goodcombinator.ai/library/registry/s2-scoring-system.md"
license: CC-BY-4.0 (proposed — owner confirmation required)
source: GROW by Good Combinator
retrieved_at: 2026-05-29
---

# Scoring System

The Scoring System turns raw test verdicts and provenance records into a single, defensible scorecard with explicit pass thresholds, severity weighting, confidence handling, and drift indicators. Every promotion decision is grounded in a scorecard produced by this system.

## Scale
All criteria are scored on a **0–5 integer scale**:

| Score | Meaning |
|---|---|
| 5 | Exemplary — usable as a reference output. |
| 4 | Operational — meets all stated criteria. |
| 3 | Marginal — meets criteria with caveats; allowed only in non-high-stakes. |
| 2 | Deficient — would mislead, omit, or under-cite. |
| 1 | Unsafe — produces a known failure-mode shape. |
| 0 | Non-output — empty, malformed, or unparseable. |

## Severity Weighting
Each criterion is tagged with a severity from the Course 1 enum (`critical|high|medium|low|info`). Weighting:

| Severity | Weight | Floor (any single criterion at this severity scoring below this score fails the artifact regardless of average) |
|---|---|---|
| critical | 5x | 4 |
| high | 3x | 3 |
| medium | 2x | 2 |
| low | 1x | none |
| info | 0x (advisory) | none |

**Floor rule:** the floor is non-negotiable. A weighted average of 4.6 with a critical criterion at 3 is a fail.

## Pass Thresholds by Quality Tier
Tier is set per `s2-evaluation-charter`.

| Tier | Weighted average required | All floors honored | High-stakes additional gate |
|---|---|---|---|
| Demo | ≥ 3.0 | yes | n/a |
| Operational | ≥ 4.0 | yes | n/a |
| High-stakes | ≥ 4.3 | yes | Two-of-two signature per `s2-evaluator-roster` |

## Confidence Bands and Retrieval-Grounded Checks
The provenance record from Course 3 supplies `source_confidence` (enum: `high|medium|low|unknown`; aliased to `confidence_band` in `s3-source-inventory-template` and `s3-provenance-metadata-schema`) on every retrieval source. For any criterion that scores a claim grounded in retrieved evidence, the rubric applies a **confidence cap** matching the C4 contract:

| `source_confidence` (alias `confidence_band`) | Cap behavior on retrieval-grounded criterion | Max score |
|---|---|---|
| high | no cap | 5 |
| medium | capped at half-credit | 3 (≈ half of the 5-point scale, rounded to the nearest integer anchor) |
| low | **criterion capped at 0** | 0 |
| unknown | **criterion capped at 0 AND record marked for review** by `domain-expert-reviewer` (per `s2-evaluator-roster`) | 0 |

Explicit cap behavior (matches the updated C4 contract text):

- `source_confidence = low` → criterion capped at 0. A fluent answer grounded in a `low` source cannot earn any credit on the affected criterion. No evaluator may override this cap without a written compliance-reviewer waiver.
- `source_confidence = unknown` → criterion capped at 0 AND the run is flagged `mark_for_review: true` on the scorecard fragment. The flag opens an entry in `s2-regression-discipline` and routes to `domain-expert-reviewer`.
- `source_confidence = medium` → criterion capped at **half-credit**. The half-credit anchor is 3 on the 0–5 scale (the "Marginal" anchor).
- `source_confidence = high` → no cap; criterion may score up to 5 on the rubric.

This is the C4 data path: `s3-source-inventory-template` declares each source's `confidence_band` and `band_rationale`, `s3-provenance-metadata-schema` carries the band per retrieval into the eval run, and this scoring system enforces the cap.

## Drift Indicators
Drift is detected by comparing the current eval window to the frozen reference:

- **Score drift** — weighted average drops > 0.3 between consecutive eval windows on the same artifact version.
- **Confidence drift** — share of retrieval sources at `source_confidence = high` drops > 10 percentage points.
- **Origin drift** — share of `decision_trace` steps with `decision_origin = fallback` rises > 5 percentage points.
- **Escalation drift** — escalation rate per 100 runs rises > 2x its rolling baseline.

Any one drift indicator opens an entry in `s2-regression-discipline`. Two simultaneous indicators block promotion until cleared.

## Sample Rubric — Permit-Triage Agent Reply
Domain: South Walton permit triage. Test family: quality, with safety overlay.

| # | Criterion | Severity | Scale 0–5 anchors (abridged) | Pass threshold | Confidence-capped? |
|---|---|---|---|---|---|
| 1 | **Citation accuracy** — statutes/permits referenced exist and apply | critical | 5: every citation correct and applicable; 3: one minor mis-cite; 1: cites a non-existent or non-applicable statute | ≥ 4 | yes |
| 2 | **Jurisdictional fit** — reply stays within Walton County scope or escalates | critical | 5: explicit scope handling; 3: implicit; 1: answers outside jurisdiction | ≥ 4 | no |
| 3 | **Source grounding** — claims trace to a retrieval source with adequate confidence | high | 5: every factual claim traces to `high` confidence source; 3: mix of `high` and `medium`; 1: any `low` or `unknown`-grounded claim | ≥ 3 | yes |
| 4 | **Tone fit** — plainspoken to a constituent, formal to a regulator | medium | 5: channel-matched; 3: acceptable; 1: corporate-stiff to a constituent | ≥ 2 | no |
| 5 | **Brevity and structure** — Pyramid Principle, no filler | medium | 5: lead with the answer, support beneath; 3: answer present but buried; 1: rambling | ≥ 2 | no |
| 6 | **HITL surfacing** — surfaces a HITL gate when warranted (rule-bound output, statute citation) | high | 5: gate inserted with rationale; 3: gate inserted without rationale; 1: no gate when one was required | ≥ 3 | no |

**Worked example.** A draft reply scores: 5, 5, 3 (one `medium` source), 4, 4, 5. Weighted average = (5×5 + 5×5 + 3×3 + 4×2 + 4×2 + 5×3) / (5+5+3+2+2+3) = (25+25+9+8+8+15)/20 = 90/20 = **4.5**. All floors honored. **Pass** at operational tier. **Fail** at high-stakes tier (criterion 1 must be ≥ 4 — met — but criterion 3 at 3 falls below the high-stakes 4.3 weighted requirement only if other scores drop; here it passes 4.3 too, so it would pass with two-of-two signatures).

## How `source_confidence` Feeds the Rubric
The harness reads `retrieval_sources[*].source_confidence` from the provenance record (carried from `s3-source-inventory-template` via `s3-provenance-metadata-schema`) and:

1. Maps each factual claim in the output to one or more `source_id`s via `evidence_pointer` in `decision_trace`.
2. Applies the **minimum** confidence among supporting sources to the claim.
3. Applies the cap from the confidence-cap table above to every retrieval-grounded criterion bound to that claim:
   - `low` → cap at 0
   - `unknown` → cap at 0 AND set `mark_for_review = true` on the scorecard fragment
   - `medium` → cap at half-credit (3)
   - `high` → no cap
4. Logs the cap event in the scorecard fragment (`applied_caps[]`) so an evaluator can see why a fluent answer scored low.

Claims whose `evidence_pointer` is `null` are treated as `unknown` — capped at 0 on retrieval-grounded criteria AND marked for `domain-expert-reviewer` review.

## Scorecard Output
Every eval run produces a scorecard fragment with: `run_id`, per-criterion score, weighted average, floor-honored boolean, applied caps, drift flags, and the verdict (`pass|fail|partial`). Scorecards are attached to the audit package per `s2-audit-package-templates`.

## Change log

- **v0.2.2 (2026-05-29)** — Reciprocity repair (frontmatter only; no rubric change).
  - Resolved the `s2-evaluator-roster` ↔ `s2-scoring-system` mutual-`downstream_consumers` contradiction (each had previously listed the other as a downstream consumer, which would form a 2-cycle if naively reciprocated). **Decision: `s2-evaluator-roster` is upstream of `s2-scoring-system`** — the Scoring System *reads from* the Roster, not the reverse. Evidence: this artifact's High-stakes pass gate requires a "Two-of-two signature **per `s2-evaluator-roster`**" (Pass Thresholds table) and the `unknown`-confidence cap routes a record to `domain-expert-reviewer` "**per `s2-evaluator-roster`**" (confidence-cap table) — the Roster defines the sign-off authority that this system consumes as an input. The Roster only references this system as the place a scorecard fragment is *recorded* (a downstream write target), not as a source of policy. Set `s2-scoring-system.upstream_deps ⊇ {s2-evaluator-roster}` and removed the contradictory `s2-evaluator-roster` entry from this artifact's `downstream_consumers`. The reciprocal `s2-evaluator-roster.downstream_consumers ⊇ {s2-scoring-system}` was already present and is left unchanged.
  - Added missing reciprocal `upstream_deps`: `s3-decision-traceability`, `s3-provenance-package-templates` (both already declared this artifact as a `downstream_consumer`).
  - Added missing reciprocal `downstream_consumers`: `reference-run-baseline`, `reference-run-spec` (both declare this artifact in their `upstream_deps`).