- Authors: Shubham Sahai, Umair Ahmed, and Ben Leong
- Published: December 2023
- Source: Proceedings of the NeurIPS’23 Workshop on Generative AI for Education (GAIED)
- Location: New Orleans, USA
- Document Link: https://neurips.cc/virtual/2023/workshop/66547
This research demonstrates that while GPT-4 outperforms traditional Automated Program Repair (APR) tools in generating code fixes, it still suffers from significant failure rates in bug detection and hallucination. The authors propose a new architecture that combines LLMs with an evaluation oracle to achieve near-perfect bug detection and high-quality feedback.
LLMs vs. APR: Evaluating 366 incorrect Python submissions from high schoolers, the study finds GPT-4 is more effective at repairing code than traditional symbolic APR techniques when guided by a proper evaluation oracle.
The “Shortcoming” Gap: Despite its power, a direct call to GPT-4 fails to detect approximately 16% of bugs, provides invalid feedback 8% of the time, and hallucinates in 5% of cases.
Proposed Architecture: To mitigate these errors, the researchers developed a system that uses an iterative prompt loop. If the first fix fails, the system feeds the test case results back to the LLM until a valid repair is found.
Superior Results: By using this “feedback on feedback” loop, the proposed system increased bug detection coverage to nearly 100%, ensuring that students do not receive misleading or incorrect guidance.
Key Contribution: The study proves that for AI to be a reliable tutor in programming education, it must be paired with rigorous execution-based verification (test cases) rather than relying on the LLM’s “intuition” alone.

Leave a Reply