A new framework named P4IR addresses the issue of hallucinated rules in large language model-based automated code compliance systems. This two-stage approach first employs supervised fine-tuning to instill domain knowledge into the model. It then utilizes Group Relative Policy Optimization to improve the accuracy of generated high-level code skeletons. The method achieved reductions of up to 23.8% in tree edit distance and 38.6% in token-level Levenshtein distance compared to supervised fine-tuning baselines. Comparative analysis shows that P4IR outperforms leading models like Claude Opus, GPT-5.2, and Qwen-3-Max in zero-shot settings. Additionally, the reinforcement learning stage produced a statistically significant reduction in false positives. This combination of techniques offers a path toward more reliable automated code compliance.