All articles
arxiv arXiv cs.LG · 12h ago

Small Language Models Outperform Frontier LLMs in Relation Extraction

A fine-tuned 0.5B-parameter Qwen2.5 model achieves 0.83 micro-F1 in general-domain relation extraction, surpassing zero-shot GPT-5.4 and Claude Sonnet 4.6. On literary benchmarks, it reaches 0.92 on the Biographical dataset, outperforming GPT-5.4 and exceeding frontier models in accuracy, demonstrating that task-adapted small models can deliver high performance with minimal hardware and privacy overhead.

media r/LocalLLaMA · 12h ago

I reverse engineered Windows Copilot into a free OpenAI-compatible API

A user has created a local API that replicates OpenAI-compatible GPT-4 functionality using Microsoft's free Copilot service. The tool logs into a Microsoft account once, runs locally on a Windows device, and exposes a server at http://localhost:8000/v1 that supports streaming and multi-turn conversations without requiring an API key or billing. It is designed for personal and educational use, and available via GitHub at https://github.com/sums001/Windows-Copilot-API.

arxiv arXiv cs.AI · 13h ago

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories

BabelJudge introduces an open-source framework to measure four key bias modes in LLM judges across languages and agent trajectories. It reveals a significant reliability drop from Hindi to Swahili—0.714 to 0's 0.550—highlighting cross-lingual degradation invisible to raw accuracy. The framework enables bias-aware evaluation without human labels, using controlled perturbations to create known gold labels, and extends to agentic workflows with new metrics on tool accuracy and hallucination detection.

arxiv arXiv cs.AI · 13h ago

SAFER: Reliable Test-Time Adaptation under Adversarial Streams

SAFER is a training-free framework that enhances robustness of test-time adaptation by using reliability-guided augmentation. It generates stochastic augmentations, pools predictions via correlation-weighted aggregation with outlier detection, and includes adaptive mixing to preserve clean performance under adversarial attacks. Evaluations on PACS, VLCS, and OfficeHome show improved resilience without sacrificing clean accuracy.

arxiv arXiv cs.AI · 13h ago

Sparsity-Storage-Accuracy Tradeoff in Parsimoniously Activated Dictionary Learning

Parsimoniously activated dictionary learning (PADL) establishes a structured generative model with auxiliary latent variables, enabling maximum a posteriori estimation. This framework provides generalization guarantees and an analytical characterization of the tradeoff between sparsity, storage cost, and reconstruction accuracy, allowing data-driven hyperparameter estimation. The resulting algorithm achieves better reconstruction performance and accelerates inference in vision-language models.