Skipping transformer blocks at runtime with llama.cpp
A fork of llama.cpp introduces a --skip-layers flag that allows users to omit entire transformer blocks during load time, offering an alternative or complement to quantization for fitting models into limited hardware.