Research paper
arxiv arXiv cs.CL · 1d ago

Agon: Autonomous Research System via Prompt Economy

Agon is an autonomous research system that uses prompt economy to validate checkable claims in workflows, leaving judgment to human scientists. It operates across 444 iterations with minimal prompts and no human-written code, revealing a taxonomy of failures by severity, fixability, visibility, and capability locus. The system demonstrates scalability and advances research toward a paradigm where machines handle scale and humans guide judgment.

arxiv arXiv cs.CL · 1d ago

Decoherence as Defence in Quantum Neural Networks for Intrusion Detection

A rigorous N-qubit theory proves that depolarising noise in stochastic quantum neural networks contracts Pauli read-outs exponentially, enabling robust anomaly detection. On the NSL-KDD dataset, such noise achieves significant adversarial resilience without catastrophic collapse, outperforming noiseless models and classical detectors under FGSM and PGD attacks, with reduced robustness variance and a train-test gap reduction of approximately 0.01.

arxiv arXiv cs.CL · 1d ago

SURGELLM: Task-Aware Feature Gating with Class-Balanced Normalization

SURGELLM introduces a unified transformer framework with surgical feature gating, task-conditioned prefix tokens, and Instance-Weighted Normalization to address inductive bias mismatches, class imbalance, and lack of lexical knowledge integration. The IWN variant achieves macro-F1 of 0.940 across four tasks, outperforming baselines by 0.036 overall and 0.130 on authorship detection, with gains confirmed as lexical rather than parametric.

arxiv arXiv cs.CL · 1d ago

Transformer Models: Architectures, Applications, and Critical Assessment

This review presents a taxonomy of transformer-based language models across domain verticals, covering encoder-only, decoder-only, encoder-decoder, long-context, permutation-based, and generator-discriminator variants. It evaluates post-2023 advancements like instruction tuning and mixture-of-experts scaling, and assesses model deployments in healthcare, finance, legal, education, customer service, creative writing, and scientific work, linking each to specific capabilities. The paper critically analyzes model architectures on four key deployment axes, quantifies parameter count versus energy cost, and examines how alignment methods, data provenance, and benchmark saturation define 'state of the art'.

arxiv arXiv cs.CL · 1d ago

UOL@IDEM Submits L1-Aware Vocabulary Prediction Model

UOL@IDEM presents a closed-track submission to BEA 2026, modeling vocabulary difficulty prediction as regression for Spanish, German, and Chinese. The system integrates multilingual contextual embeddings with engineered features like frequency and cognate similarity, achieving lower RMSE scores than baselines, with feature analysis highlighting frequency as the most stable predictor and contextual predictability as a key L1-sensitive signal.

arxiv arXiv cs.CL · 1d ago

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation

AdversaBench introduces an end-to-end red-teaming pipeline that generates adversarial prompts via five structured operators, evaluates target models, and confirms failures through a three-judge panel with meta-judge tiebreaker. Experiments on 45 seed prompts across reasoning, instruction-following, and tool use show every seed produces a confirmed failure, with operator effectiveness, failure iteration counts, judge agreement, and cross-model transferability revealing key patterns in LLM vulnerability.

arxiv arXiv cs.AI · 1d ago

MedLayXPlain: Benchmarking Expert-Lay Gap in Medical Vision-Language Models

MedLayXPlain introduces the first large-scale benchmark for medical lay language generation, featuring 122,789 region-grounded samples across eight imaging modalities. It evaluates medical vision-language models on expert-lay alignment using a hierarchical ontology system and a lightweight evaluator, revealing a systematic gap: expert-level performance in captioning coexists with significant degradation in lay language, while general-purpose models lack clinical precision.

arxiv arXiv cs.AI · 1d ago

QBioFusion-QSAR: Quantum Kernel Learning for Small-Data Ligand Classification

QBioFusion-QSAR integrates a quantum fidelity kernel with Morgan/Tanimoto fingerprints to improve ligand classification. On the PsychLight-A benchmark, QMKL increased accuracy and MCC compared to Morgan/Tanimoto alone, with improvements attributed to better predictions of molecules with activity cliffs, such as N-Me-5-HT and N-Me-tryptamine. Auditable analysis confirms localized quantum-kernel contributions in small-data settings.