A Reddit user is planning to deploy a machine with multiple GPUs for serving coding and Hermes models, seeking solutions that allow flexible configuration swapping without manual intervention.

  • The user aims to switch between running two smaller models for less-intensive tasks, one large model across multiple GPUs, or a larger coding-focused model based on current needs.
  • They have evaluated llamaswap, LiteLLM, llamactl, and GPUStack but found issues with flexibility, enterprise focus, or tuning requirements.
  • The hardware setup includes up to four 3090s on a Threadripper 3945WX with ~128GB of DDR4 RAM.

The user is asking the community for recommendations on tools that minimize manual intervention and allow self-contained orchestration by Hermes.