This article introduces a cascaded multi-granularity pruning framework designed to deploy large language models on Industrial Internet of Things (IIoT) edge devices by removing layers, attention heads, and feed-forward channels in a coarse-to-fine order. The method utilizes lightweight low-rank recovery between stages to re-estimate component importance, addressing the collapse of existing structured pruning methods at high compression ratios.
- The framework extends achievable compression to 13.8 times on Multi-Head Attention (MHA)+GELU architectures, achieving 83.82% accuracy, which is 3.70 percentage points higher than the strongest baseline.
- An information-theoretic analysis formalizes the Structural Independence Assumption (SIA), revealing that MHA+GELU designs satisfy this condition while Grouped Query Attention (GQA)+SwiGLU designs violate it.
- Models violating the SIA experience an approximately 74 percentage point accuracy collapse, highlighting the importance of architectural compatibility for pruning reliability.
- Deployment on an industrial slewing bearing fault diagnosis platform with NVIDIA DGX Spark reduced inference latency by up to 67.2% and peak memory by 62.5%.
The authors consider this significant because it demonstrates the viability of compressed models for IIoT edge inference, providing a checkable condition to predict pruning reliability based on specific architectural designs.