Skipping transformer blocks at runtime with llama.cpp

A fork of llama.cpp introduces a --skip-layers flag that allows users to omit entire transformer blocks during load time, offering an alternative or complement to quantization for fitting models into limited hardware.

The feature implements runtime pruning by preventing the instantiation of specified layers.
A selector mechanism is included because the choice of which blocks to skip significantly impacts performance.
This approach enables users to run models that would otherwise exceed their device's memory capacity.

This technique provides a practical method for deploying larger language models on constrained hardware by reducing memory requirements without requiring model retraining.