Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment

This study benchmarks traditional machine learning methods against lightweight transformer architectures for binary fault detection across three public datasets, evaluating tradeoffs between accuracy, model size, and latency. The research assesses classification performance using F1-score and AUC, while also testing INT8 dynamic quantization and a two-stage adaptive inference pipeline to optimize deployment on resource-constrained hardware.

Lightweight transformers matched traditional ML at 87.8% F1 on the C-MAPSS dataset but required 100x model size and 9000x latency.
TinyBERT-4L was identified as the most deployment-friendly transformer, with a size of 55 MB and 18 ms CPU latency.
INT8 quantization reduced model size by 25% while maintaining an F1-score of 86.9%.
An adaptive inference pipeline achieved 87.6% F1 at 19.5 ms average latency by routing 97.9% of predictions through a quantized triage model.
Both traditional and transformer methods struggled significantly on severely imbalanced datasets like SECOM and UCI-PM, highlighting limitations for extreme class imbalance.

The findings provide critical insights for deploying fault detection systems on edge devices, demonstrating that while lightweight transformers can match traditional accuracy, significant optimization is required to mitigate their high resource costs.