NL2Scratch: Executable Benchmark for NL-to-Scratch Generation

NL2Scratch introduces an executable benchmark with 311,648 parser-valid NL-program pairs derived from real Scratch projects. It proposes Semantic Alignment Consistency (SAC) to measure semantic agreement, validating 23,594 examples and creating an 800-slot-balanced diagnostic benchmark. Experiments show a significant gap between lexical similarity and semantic alignment, with models achieving high token-level F1 often failing to reach perfect SAC, especially on longer examples.