Purpose
This rubric is for sampled review of real pilot cases. It is meant to assess quality of AI-assisted work, not to evaluate employee worth.
How to use this rubric
- Review a small sample of cases each week.
- Use it on target pilot workflows only.
- Score the case on a
0-2scale per category. - Aggregate results at workflow or team level rather than turning the score into an individual leaderboard.
- Capture a short note on what the case teaches, not just what it scored.
Reviewer prerequisites
- The reviewer should understand the target workflow well enough to judge verification quality.
- The reviewer should know whether the task was familiar or unfamiliar, low or high risk, and easy or hard to verify.
- The reviewer should have enough context to see whether the output created cleanup later.
Sample size guidance
5-10sampled cases per week across the pilot is usually enough for a small initial pilot cohort.- Bias the sample toward important workflows, unfamiliar work, and suspected failure modes.
Sampling pattern
If possible, aim for a mixed weekly sample:
1-2cases that appear strong1-2cases that appear mixed1-2cases that triggered concern, review burden, or soft-signal unease
Do not sample only the best-looking cases.
Case metadata
- date
- role
- team
- workflow
- workflow mode:
learningordelivery - AI surface used: chat, IDE assistant, repo-aware assistant, other
- task familiarity:
F1,F2, orF3 - task risk:
R1,R2, orR3 - verification difficulty: low, medium, or high
- reviewer
- artifact or case link
Scoring scale
0: weak or unsafe1: mixed or incomplete2: strong and appropriate
Rubric categories
1. Workflow fit
Question:
Was AI used in a workflow-appropriate way for the task, role, and risk level?
0: AI use was poorly matched to the workflow or risk context1: AI use was partly appropriate but included avoidable misuse or drift2: AI use fit the workflow and risk context well
2. Explanation ability
Question:
Could the participant explain the AI-assisted output or reasoning in their own words?
0: could not explain key parts or relied on vague repetition1: partial explanation with notable gaps2: clear explanation of what happened, why it happened, and what could fail
3. Verification quality
Question:
Did the participant use appropriate checks before trusting or shipping the result?
0: weak or absent verification1: some verification, but missing important checks2: verification matched the task, risk, and likely failure modes
4. Risk handling
Question:
Did the participant handle privacy, security, policy, and verification difficulty appropriately?
0: ignored or mishandled key risk factors1: recognized some risks but handled them inconsistently2: handled major risks appropriately and escalated when needed
5. Downstream quality
Question:
Did the output create unnecessary rework, confusion, or review burden?
0: created significant downstream issues1: mixed result with some cleanup or preventable burden2: output was useful without obvious hidden cost
Optional notes
- strongest part of the case
- biggest concern
- what the team should learn from this example
- whether this should become a good-example or bad-example case in the enablement library
- what follow-up action is needed, if any
Immediate escalation override
Do not hide serious problems inside an average score.
Escalate immediately if:
- a high-risk case scores
0onVerification quality - a case shows a privacy, security, or policy exception
- the participant could not explain a high-impact output that was still trusted or shipped
Interpretation
0-3: weak case, likely indicates misuse or major gaps4-7: mixed case, usable for coaching and refinement8-10: strong case, candidate for reuse as an example
Interpret totals cautiously.
The notes and red flags matter as much as the sum.
What not to do
- do not use the rubric as an employee rating system
- do not review every case
- do not score without enough context to understand the task and constraints
- do not reduce the review to only throughput