A Reddit user is seeking advice on upgrading their local large language model setup, specifically weighing the trade-off between inference speed and general knowledge capabilities.

  • The user currently runs Qwen3.6 35B as their primary assistant and coding agent on a Strix Halo device.
  • They report achieving approximately 30-40 tokens per second with a 131k context window.
  • The user feels the current model lacks basic general knowledge and functions more like an executioner than an assistant.
  • To address this, they are considering switching to the larger Qwen3.5 122B model while trying to maintain acceptable speed.