Citation
Shen, Judy Hanwen, and Alex Tamkin. How AI Impacts Skill Formation. arXiv:2601.20245, February 3, 2026.
Companion materials:
- Anthropic article: https://www.anthropic.com/research/AI-assistance-coding-skills
- Public repository: https://github.com/safety-research/how-ai-impacts-skill-formation
What it says
- This is a randomized between-subjects experiment on software developers learning an unfamiliar Python library, Trio, with or without AI assistance.
- The main study reports
n=52, with26participants in the AI condition and26in the control condition. - The AI condition used a chat-based coding assistant backed by
GPT-4othat could see the participant's code and generate the full solution when prompted. - The paper reports a lower immediate post-task mastery score for the AI group:
50%average quiz score versus67%for the control group, a17%absolute gap withCohen's d = 0.738andp = 0.010. - The paper reports a small average speed improvement for the AI group, but says the completion-time difference was not statistically significant.
- The largest reported score gap was in debugging, with additional gaps in code reading and conceptual understanding.
- The authors argue that heavy delegation and AI-led debugging were associated with weaker learning outcomes, while explanation-seeking and conceptual inquiry were associated with better outcomes among AI users.
What we could verify directly
- The appendix states that quiz grading was defined before the experiment in a pre-registration, but the public pre-registration link was not found during this review.
- The appendix states that annotated participant transcripts are publicly available, and that claim is true through the GitHub repository.
- The public repository includes:
- the Trio task instructions
- starter and completed task code
- a post-task notebook with quiz material fragments
52annotated participant transcript JSON files- The public transcript files include participant condition, task duration, annotated events, active coding time, task completion, and AI query text when applicable.
- From the public transcript JSON files, the mean main-task time is about
23.16minutes for the AI condition and24.93minutes for the No-AI condition, which is directionally consistent with the paper's claim of a small speed advantage for AI. - From the public transcript JSON files, AI participants averaged about
5.48AI queries, with a maximum of15. - From the public transcript JSON files, No-AI participants encountered more observed errors on average than AI participants, which is directionally consistent with the authors' interpretation that independent error resolution may have contributed to learning.
Public data availability
The closest thing to public raw data is the annotated transcript set in the GitHub repository. It is useful, but it is not the full raw experimental dataset.
Publicly available:
- Annotated transcripts for each participant in JSON form
- Task materials and code artifacts
- Partial survey or quiz material in the notebook
Not found publicly during this review:
- Raw quiz responses
- A participant-level quiz score table
- Raw keystroke logs
- Screen recordings
- The public pre-registration record
The paper says quiz responses were stored in Google Drive and coding keystrokes were stored in the coding platform, but those raw stores do not appear to be public.
Limitations and concerns
- The study measures immediate post-task understanding, not long-term retention or workplace mastery.
- The task uses one unfamiliar library, one chat-style assistant setup, and one short learning window, so external validity is limited.
- The paper's stronger causal claim is the group-level quiz difference; the AI interaction pattern analysis is much weaker because the pattern groups are very small.
- The public repository does not expose enough structured outcome data to reproduce the headline quiz effect independently.
- There is a documentation mismatch worth noting:
- the main study section reports
26AI and26control participants - the appendix says
51/52participants uploaded recordings and that recordings for25AI and25No-AI participants were reviewed - the public transcript repository currently contains
52JSON files split as25AI and27No-AI, with participant26missing - Because of that mismatch, the public release should be treated as useful but not fully clean replication material.
Relevance to AI enablement
This source is highly relevant to the project because it directly supports several working assumptions in Project Charter.
- It supports a comprehension-first rollout stance: using AI to understand code, concepts, and errors is safer than defaulting to code delegation.
- It supports capability-aware adoption: less experienced engineers may need explanation-first patterns and deliberate guardrails rather than unrestricted acceleration workflows.
- It supports the emphasis on paired engineering with AI rather than blind delegation.
- It supports measuring debugging and code-reading ability, not just speed, when evaluating adoption success.
- It supports the idea that enablement leaders should teach workflows, verification habits, and learning-preserving usage patterns rather than only granting tool access.
What not to overclaim
- This source does not prove that all AI coding assistance harms learning.
- This source does not justify banning AI for software engineering.
- This source does not establish long-term skill erosion.
- This source does not cleanly generalize to senior engineers, agentic coding systems, autocomplete-heavy tools, or familiar tasks.
Practical extraction for this project
- Prefer AI workflows that force or invite explanation, reflection, and independent reasoning.
- Treat debugging skill as a critical safeguard metric in AI rollout.
- Avoid presenting adoption success purely as speed gain.
- Use AI-free comprehension checks or review exercises when the goal is learning a new tool, framework, or internal platform.
- Teach engineers to ask
why,how, andwhat would happen ifquestions instead of only requesting implementations.
Follow-up questions
- Can the authors release participant-level quiz scores or at least an anonymized aggregate score table?
- Is the pre-registration public, and does it include the grading rubric?
- Why does the public transcript release not line up cleanly with the participant counts reported in the paper and appendix?
- How much of the observed effect persists after days or weeks rather than minutes?