Code generation
arxiv arXiv cs.CL · 1d ago

Match Task to Objective Framework for Encoder-Decoder Models

This study introduces the Match Task to Objective (MTO) framework to align pre-training and fine-tuning objectives with specific tasks. The framework enables automated, unsupervised data adaptation and delivers performance gains of over 120% in few-shot settings, outperforming baselines in both few-shot and full-dataset scenarios. It also enhances prompt-tuning by providing effective soft prompt engineering guidance.

arxiv arXiv cs.CL · 1d ago

Metis: Bridging Text and Code Memory for Self-Evolving Agents

Metis introduces a hierarchical dual-representation memory that combines text and code memory to improve self-evolving agents. It organizes experience into execution plans, facts, and pitfalls, crystallizing reusable plans into validated tools only when justified. Evaluated on AppWorld, Metis achieves up to 20.6% higher task accuracy and 22.8% lower execution cost than ReAct, with better overall balance across accuracy, efficiency, and memory cost.

media r/LocalLLaMA · 1d ago

650+ Apache-2.0 biomedical NER/de-ID models run 30-40x faster on Apple Silicon

A new open-source project offers 650+ Apache-2.0 licensed biomedical NER and de-identification models that run on-device via MLX. On a 3-year-old MacBook Pro with M3 Max, clinical NER models achieve 30-40x speedups over PyTorch-CPU with identical fp32 outputs and entity results, due to architectural efficiency on Apple Silicon. The models, including 434M biomedical NER and PII de-ID, are publicly available on Hugging Face and GitHub, with full reproducibility provided in code and methodology.

arxiv arXiv cs.AI · 1d ago

Benchmark Evaluation of Small Language Models for Arabic NLP

A benchmark of 240 Arabic test items across eight domains and ten skills assesses twelve small language models in zero-shot settings. Gemma 3 (12B) achieved the highest overall score (4.548/5), followed by Aya and C4AI Command Arabic, with performance linked more to Arabic alignment and instruction-following than model size. Common failure modes include prompt leakage, hallucination, and weak task adherence.

media r/LocalLLaMA · 1d ago

Tmax-27B Terminal Agent for Small GPUs with DPPO Training

Tmax-27B is a terminal agent based on Qwen3.6-27B, trained with DPPO (RL), achieving 43% on Terminal Bench 2.0 and 69% on TB Lite. To run on consumer GPUs, it is quantized using importance-matrix-calibrated GGUF models from 2 to 5 bits per weight, with a grafted MTP head enabling speculative decoding. IQ2_XS at 8.5 GiB achieves 70% pass rate in agentic coding tasks, outperforming plain quantization and demonstrating stable tool-call generation.

media r/LocalLLaMA · 2d ago

New Qwen-27B IQ4_KS and IQ4_KS_KT Quantizations for ik_llama.cpp

Two new GGUF quantizations for Qwen-27B have been released for ik_llama.cpp, optimized for 16GB VRAM on NVIDIA GPUs. The first, Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KS.gguf, improves logical reasoning at the cost of general knowledge, with a perplexity of 7.4131. The second, Qwen3.6-27B.i1-IQ4_KS_KT-attn_qkv-IQ4_KS.gguf, applies Trellis quantization (iq4_kt) selectively to tensors with near-Gaussian distributions, achieving a perplexity of 7.4091, showing minimal performance degradation.