Improving the Coverage of GPT for Automated Feedback on High School Programming Assignments

February 25, 2026

Authors: Shubham Sahai, Umair Ahmed, and Ben Leong
Published: December 2023
Source: Proceedings of the NeurIPS’23 Workshop on Generative AI for Education (GAIED)
Location: New Orleans, USA
Document Link: https://neurips.cc/virtual/2023/workshop/66547

This research demonstrates that while GPT-4 outperforms traditional Automated Program Repair (APR) tools in generating code fixes, it still suffers from significant failure rates in bug detection and hallucination. The authors propose a new architecture that combines LLMs with an evaluation oracle to achieve near-perfect bug detection and high-quality feedback.

LLMs vs. APR: Evaluating 366 incorrect Python submissions from high schoolers, the study finds GPT-4 is more effective at repairing code than traditional symbolic APR techniques when guided by a proper evaluation oracle.

The “Shortcoming” Gap: Despite its power, a direct call to GPT-4 fails to detect approximately 16% of bugs, provides invalid feedback 8% of the time, and hallucinates in 5% of cases.

Proposed Architecture: To mitigate these errors, the researchers developed a system that uses an iterative prompt loop. If the first fix fails, the system feeds the test case results back to the LLM until a valid repair is found.

Superior Results: By using this “feedback on feedback” loop, the proposed system increased bug detection coverage to nearly 100%, ensuring that students do not receive misleading or incorrect guidance.

Key Contribution: The study proves that for AI to be a reliable tutor in programming education, it must be paired with rigorous execution-based verification (test cases) rather than relying on the LLM’s “intuition” alone.

Improving the Coverage of GPT Download

•

Publications

AI Centre for Educational Technologies

Improving the Coverage of GPT for Automated Feedback on High School Programming Assignments

Comments

Leave a Reply Cancel reply