Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT

This article introduces a cascaded multi-granularity pruning framework designed to deploy large language models on Industrial Internet of Things (IIoT) edge devices by removing layers, attention heads, and feed-forward channels in a coarse-to-fine order. The method utilizes lightweight low-rank recovery between stages to re-estimate component importance, addressing the collapse of existing structured pruning methods at high compression ratios.

The framework extends achievable compression to 13.8 times on Multi-Head Attention (MHA)+GELU architectures, achieving 83.82% accuracy, which is 3.70 percentage points higher than the strongest baseline.
An information-theoretic analysis formalizes the Structural Independence Assumption (SIA), revealing that MHA+GELU designs satisfy this condition while Grouped Query Attention (GQA)+SwiGLU designs violate it.
Models violating the SIA experience an approximately 74 percentage point accuracy collapse, highlighting the importance of architectural compatibility for pruning reliability.
Deployment on an industrial slewing bearing fault diagnosis platform with NVIDIA DGX Spark reduced inference latency by up to 67.2% and peak memory by 62.5%.

The authors consider this significant because it demonstrates the viability of compressed models for IIoT edge inference, providing a checkable condition to predict pruning reliability based on specific architectural designs.