This article introduces a cascaded multi-granularity pruning framework designed to deploy large language models on Industrial Internet of Things (IIoT) edge devices by removing layers, attention heads, and feed-forward channels in a coarse-to-fine order. The method utilizes lightweight low-rank recovery between stages to re-estimate component importance, addressing the collapse of existing structured pruning methods at high compression ratios.

  • The framework extends achievable compression to 13.8 times on Multi-Head Attention (MHA)+GELU architectures, achieving 83.82% accuracy, which is 3.70 percentage points higher than the strongest baseline.
  • An information-theoretic analysis formalizes the Structural Independence Assumption (SIA), revealing that MHA+GELU designs satisfy this condition while Grouped Query Attention (GQA)+SwiGLU designs violate it.
  • Models violating the SIA experience an approximately 74 percentage point accuracy collapse, highlighting the importance of architectural compatibility for pruning reliability.
  • Deployment on an industrial slewing bearing fault diagnosis platform with NVIDIA DGX Spark reduced inference latency by up to 67.2% and peak memory by 62.5%.

The authors consider this significant because it demonstrates the viability of compressed models for IIoT edge inference, providing a checkable condition to predict pruning reliability based on specific architectural designs.