media r/LocalLLaMA · 4d ago · open_models

GLM 5.2 Local Inference Speeds Report

from English

Users reporting local GLM 5.2 inference speeds using llama.cpp on 6x RTX 3090 with 128GB DDR5 and an i7-13700K achieve 7.8 tokens/sec at 90K context size with Q8_0 quantization. Prompt processing occurs at approximately 40 tokens/sec.

Importance 1/3 r/LocalLLaMA Zhipu AI Inference efficiency

Read original