Benchmark results — korshunov.ai

Topic · Benchmark results

Benchmarking Agentic Review Systems for AI-Assisted Research

A study evaluates four AI review systems across six language models, finding OpenAIReview with GPT-5.5 achieves 83.0% accuracy in matching paper quality to external signals and detects 71.6% of injected errors. Real user feedback shows positive sentiment, with a 1.44-to-1 vote ratio, though false positives and minor nitpicks remain common.

arxiv arXiv cs.CL · 6d ago

JAMER: Project-Level Code Framework Dataset and Benchmark

JAMER introduces JamSet and JamBench, the first project-level game code dataset and benchmark on a professional game engine. Built from 8,133 verified Game Jam projects, it enables deterministic evaluation and reveals a capability cliff in AI models as project scale increases, with runtime pass rates dropping from 80.4% to 5.7%.

arxiv arXiv cs.CL · 6d ago

REDACT: Multilingual PII Benchmark with Systematic Control

REDACT introduces a systematically controlled multilingual benchmark for personally identifiable information detection, featuring 51 entity types, 4,127 surface-form patterns, and 25 languages. It evaluates five detectors across 1,000 records, revealing that rule-based models fail on high-stakes data while LLMs perform better, especially in high-sensitivity categories. A reference-free LLM assessment confirms sensitivity-tier assignment as the most challenging evaluation axis.

media r/LocalLLaMA · 6d ago

GLM-5.2 Outperforms GPT-5.5 in AA-Briefcase Evaluation

Artificial Analysis' new agentic knowledge work evaluation, AA-Briefcase, shows GLM-5.2 surpassing GPT-5.5 in performance. The benchmark assesses real-world task execution and reasoning capabilities in knowledge work scenarios.

arxiv arXiv cs.LG · 7d ago

Diffusion-Proof: First Framework for Diffusion LLMs in Formal Theorem Proving

Diffusion-Proof is the first framework to train and apply diffusion language models for formal theorem proving. It introduces dLLM-Prover-7B for whole-proof writing with long-range coherence and dLLM-Corrector-7- for local proof correction using bidirectional information. The framework outperforms auto-regressive LLM baselines by 1.61% on ProofNet-Test and 6.14% on MiniF2F-Test, and solves an IMO problem beyond the capability of DeepSeek-Prover-V2-7B.

arxiv arXiv cs.CL · 7d ago

Frustrated Synchronization Network Outperforms Transformers

The Frustrated Synchronization Network (FSN) achieves lower validation loss than a RoPE-SwiGLU transformer at every epoch on character-level text and code tasks. At one million parameters, FSN converges to a validation loss of 1.5953 ± 0.0014, outperforming the transformer's converged loss of 1.611. This advantage persists up to four million parameters, with ongoing evaluations beyond that scale.

arxiv arXiv cs.CL · 7d ago

SenFlow: Advanced AI-Generated Text Detection in Hybrid Documents

SenFlow introduces a novel method for detecting AI-generated text in hybrid documents by modeling inter-sentence dependencies. It achieves state-of-the-art performance on MOSAIC, a benchmark of 16,000 documents from PubMed and XSum, with a +4.15 pp Macro-F1 gain on cross-domain transfer. SenFlow reveals that AI-generated content still exhibits generator-dependent sentence-length patterns, exploitable by sentence-level detectors despite perplexity filtering.

media r/LocalLLaMA · 8d ago

GLM-5.2 crosses 80% on Terminal-Bench

GLM-5.2 is the first open-weights model to achieve 80% accuracy on Terminal-Bench and outperforms all other available open models. It also surpasses Gemini, positioning it as a frontier-level model at a significantly lower cost.

media Don't Worry About the Vase · 2d ago

GLM-5.2 Is the New Best Open Model

GLM-5.2 achieves benchmark scores near frontier levels, matching Opus 4.7 in text-only tasks and ranking among the top open models on multiple tests. It is the strongest open model currently available, outperforming predecessors and rivals like GPT-5.5 and Fable, though it falls short on specialized benchmarks like anti-sycophancy and has limited vision capabilities.

media r/LocalLLaMA · 5d ago

GLM-5.2 is the new leading open weights model on the Artificial Analysis Intelligence Index

GLM-5.2 has been designated as the leading open weights model on the Artificial Analysis Intelligence Index. This recognition reflects its performance and capabilities within the open-source AI model landscape.

media r/LocalLLaMA · 5d ago

New Agentic Benchmark Released

Artificial Analysis has introduced a new agentic benchmark that evaluates large language models' ability to plan and execute tasks. Claude Fable and GLM 5.2 achieved top positions within their respective cohorts, demonstrating strong performance on this unsaturated benchmark.

media Latent Space · 5d ago

GLM-5.2 Passes Vibe Check, Outperforms GPT-5.5

GLM-5.2 has passed a 'vibe check' as a frontier open model, receiving praise from Jeremy Howard and outperforming GPT-5.5 in Artificial Analysis' new knowledge work benchmark. It also gained validation from the /r/LocalLlama community, indicating strong real-world utility and performance.

arxiv arXiv cs.AI · 6d ago

QMFOL: Benchmarking LLM Reasoning with Controllable Logical Complexity

QMFOL is an automated framework that generates monadic first-order logic reasoning tasks with quantifiable complexity. It produces 2880 benchmark instances across 960 configurations, evaluating six large reasoning models and two LLMs, showing performance degradation and increased computational cost as logical complexity rises.

arxiv arXiv cs.CL · 6d ago

CombEval: Benchmark for Combinatorial Counting in LLMs

CombEval is a dynamic benchmark that generates natural-language counting problems with verified answers using typed Cofola specifications. It evaluates 11 large language models and reveals persistent failures in handling ordered objects, indistinguishable elements, positional constraints, and nested dependencies, with errors rooted in constraint interpretation and counting principles.

arxiv arXiv cs.AI · 7d ago

Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images

A new benchmark evaluates AI-generated text-rich images across six domains, including commercial posters and receipts. It reveals significant domain-dependent performance and sensitivity to JPEG compression, highlighting the need for text- and layout-aware detection methods.

arxiv arXiv cs.CL · 7d ago

ForecastBench-Sim: Simulated World Forecasting Benchmark

ForecastBench-Sim is a simulated-world forecasting benchmark using Freeciv game rollouts. It enables continuous or binary forecasts at arbitrary horizons, with intervention worlds for causal questions and rare outcomes, and provides immediate, resolvable feedback for evaluating probabilistic reasoning in dynamic environments.

arxiv arXiv cs.AI · 7d ago

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench introduces a benchmark to evaluate AI4Science safety by assessing models across 7 disciplines, 31 subdisciplines, and 10 risk dimensions. It evaluates both mainstream and science-oriented LLMs to identify specific gaps in risk recognition and avoidance within high-stakes scientific contexts.

media r/LocalLLaMA · 7d ago

SIQ-1 Qwen3.6 Achieves Strong Performance in Autoresearch and Benchmarking

The SIQ-1 model, trained using PPO with verifiable reward, outperforms GLM-5.2 and Qwen-350B on parameter-golf tasks, with outputs resembling Opus4.8. It also beats NEX and GPT-5.5 on the bullshit-bench test. The model and GGUF version are available on Hugging Face, along with a ZeroGPU-compatible agent demo.

arxiv arXiv cs.LG · 8d ago

SCBoost: Reducing Learner Redundancy via Residual Orthogonalization

SCBoost introduces residual orthogonalization to eliminate learner redundancy in boosting. It uses Spectral Residual Projection and Covariance-Regularized Weighting to ensure each learner captures novel error components and reduces ensemble correlations. Theoretical analysis and experiments show improved accuracy and F1 scores on ten benchmark datasets.

media r/LocalLLaMA · 8d ago

GLM-5.2 Now First on Design Arena

GLM-5.2 has been ranked first on Design Arena, surpassing the previously available Claude Fable 5. The Claude Fable 5 model is now unavailable, contributing to GLM-5.2's top position.