AI-Assisted Feedback
Augmenting Teaching Assistants with AI for CS1 Programming Feedback
Should we replace TAs with AI, or give TAs better tools? We ran a 185-student trial to find out.
About This Project
There is a growing temptation to simply replace human Teaching Assistants with LLM-based chatbots. We wanted to test a different idea: what if we used AI to augment human TAs instead of replacing them?
We ran one of the first large-scale randomised trials of this kind with 185 CS1 undergraduates at IIT Kanpur during a live, graded programming lab. Students worked on C programming exercises involving pointers and linked lists, where misconceptions are common and feedback quality really matters. We built an automated assistant powered by GPT-4 Turbo that could generate line-level feedback for buggy student code, and integrated it into the students’ existing programming environment (Prutor) so they would not notice any change in their workflow.
Students were split into six experimental groups. Some received feedback directly from the AI (in either a default or Socratic style). Others received feedback from human TAs who had access to AI-drafted hints they could review, edit, or ignore. A third set worked with TAs who had no AI assistance at all.
The results were surprising. Students perceived the AI-augmented feedback as more accurate and helpful, but when we looked at actual performance, students who were helped by unassisted TAs completed the harder question faster and in fewer attempts. It turns out human TAs without AI tended to give shorter, more direct feedback (averaging 13 words vs. the AI’s 129), sometimes just pointing students at the immediate next step. This brevity, while less “impressive” to students, was more effective at getting them unstuck.
More concerning was a pattern of TA complacency. In 83% of cases, TAs added their own comments on top of the AI output, but not a single TA ever deleted AI-generated content, even though roughly 9% of it contained hallucinations as per expert evaluation. Only one TA across the entire experiment edited the AI text before sending it. The human-in-the-loop, in practice, was not catching the errors we expected them to filter out.
We also found that students could not reliably identify hallucinated feedback, and rated it just as positively as correct feedback. Expert reviewers, by contrast, spotted these issues easily and rated them lower. This gap between student satisfaction and actual feedback quality is a real risk for any institution considering AI-generated feedback at scale.
Research Questions
- Does giving TAs access to AI-generated feedback drafts actually make them more effective, or does it introduce new failure modes?
- How does the style of AI feedback (direct hints vs. Socratic prompts) affect student performance and TA efficiency?
- Can a human-in-the-loop reliably catch hallucinations and errors in LLM-generated programming feedback?
Methods
- Randomised Controlled Trial
- User Study (185 students, live graded lab)
- Qualitative Feedback Analysis
- Large Language Models (GPT-4 Turbo)
Framework
- Human-in-the-Loop AI Tutoring
Key Findings
- Students helped by unassisted TAs outperformed those receiving AI-augmented feedback on the harder question. Human TAs without AI gave shorter, more focused feedback (avg. 13 words vs. AI’s 129 words) that got students unstuck faster, sometimes by just pointing at the immediate next step.
- AI feedback style had no significant effect on outcomes: students receiving default-style and Socratic-style AI feedback had comparable completion rates (p = 0.85), though students preferred the more direct style.
- TA complacency was pervasive. In 83% of responses, TAs added comments on top of AI output—but no TA ever deleted AI-generated content, and only one TA edited it, despite a 9% hallucination rate in the AI feedback.
- Students cannot reliably detect hallucinations: they rated hallucinated feedback just as positively as correct feedback. Expert reviewers easily identified these issues, revealing a dangerous gap between student satisfaction and actual feedback quality.
- Socratic-style AI feedback imposed a cognitive burden on TAs, causing them to take significantly longer to respond (p = 0.03) than TAs working without any AI support.
- Our AI agent achieved a precision of 87% with a hallucination rate of 8.7%, which was surprisingly comparable to human TAs who achieved 88% precision, including one instance of a human TA hallucinating on the student’s request!
Publications: Feasibility Study of Augmenting Teaching Assistants with AI for CS1 Programming Feedback
Abstract:
With the increasing adoption of Large Language Models (LLMs), there are proposals to replace human Teaching Assistants (TAs) with LLM-based AI agents for providing feedback to students. In this paper, we explore a new hybrid model where human TAs receive AI-generated feedback for CS1 programming exercises, which they can then review and modify as needed. We conducted a large-scale randomized intervention with 185 CS1 undergraduate students, comparing the efficacy of this hybrid approach against manual feedback and direct AI-generated feedback. Our initial hypothesis predicted that AI-augmented feedback would improve TA efficiency and increase the accuracy of guidance to students. However, our findings revealed mixed results. Although students perceived improvements in feedback quality, the hybrid model did not consistently translate to better student performance. We also observed complacency among some TAs who over-relied on LLM generated feedback and failed to identify and correct inaccuracies. These results suggest that augmenting human tutors with AI may not always result in improved teaching outcomes, and further research is needed to ensure it is truly effective.

Fig. 2b from the paper. User interface for providing hints to struggling CS1 programming students. Depending on their group, feedback is routed through AI, TA, or hybrid (TA + AI)

Fig. 6 from the paper. Experts identified hallucinations and inaccuracies in feedback that students missed. Socratic-style feedback, by its nature, was the least revealing hint but also more cognitively demanding for TAs

Fig. 3b from the paper. Time taken and number of attempts across groups. TA-Manual students completed the harder question faster with fewer attempts than AI-augmented groups
Datasets
SIGCSE 2025 User Study Artifacts:
Experimental data from the 185-student randomised trial: anonymised data, feedback logs, student ratings, expert annotations, and performance metrics.
Code & Tools
SIGCSE 2025 User Study Artifacts:
Analysis scripts for the randomised controlled trial.
Team Members:
Collaborator:
Lead PI:
Published:
Conference:
Umair Z. Ahmed, Shubham Sahai
Amey Karkare (IIT Kanpur)
Ben Leong Wing Lup
2025
SIGCSE TS 2025

