A new framework enables online reward-punishment learning without environment rewards, using only fixed-channel perceptual packets. It achieves high accuracy in value inference and policy optimization, with B_xi attaining 0.952 balanced reward-sign accuracy and overall policy performance reaching 0.979 optimal-action accuracy in tested tasks, outperforming controls like zero reward and shuffled targets.