Evaluation & benchmarks
arxiv arXiv cs.CL · 1d ago

BehaviorBench Launches Benchmark for Behavioral AI Models

BehaviorBench introduces a comprehensive benchmark to evaluate foundation models across four behavioral science capabilities: behavior prediction, strategic decision-making, subject-trait inference, and knowledge application. It assesses models at both individual and distributional levels, revealing that behavioral foundation models like Be.FM-1.5 achieve stronger distributional alignment than general-purpose models, highlighting the need for distributional evaluation in behavioral AI.

arxiv arXiv cs.CL · 1d ago

Dialogue to Discovery: Attribute-Aware Preference Elicitation

Dialogue to Discovery (D2D) is an attribute-oriented framework that improves conversational product search by dynamically guiding user interactions. It adapts query priorities and recommendation timing, achieving 22.2-29.9% higher target-finding accuracy, 6.6-16.1% lower abandonment, and 27.5% shorter conversations compared to existing methods, with user studies confirming improved satisfaction and efficiency.

arxiv arXiv cs.CL · 1d ago

Decoherence as Defence in Quantum Neural Networks for Intrusion Detection

A rigorous N-qubit theory proves that depolarising noise in stochastic quantum neural networks contracts Pauli read-outs exponentially, enabling robust anomaly detection. On the NSL-KDD dataset, such noise achieves significant adversarial resilience without catastrophic collapse, outperforming noiseless models and classical detectors under FGSM and PGD attacks, with reduced robustness variance and a train-test gap reduction of approximately 0.01.

arxiv arXiv cs.CL · 1d ago

SURGELLM: Task-Aware Feature Gating with Class-Balanced Normalization

SURGELLM introduces a unified transformer framework with surgical feature gating, task-conditioned prefix tokens, and Instance-Weighted Normalization to address inductive bias mismatches, class imbalance, and lack of lexical knowledge integration. The IWN variant achieves macro-F1 of 0.940 across four tasks, outperforming baselines by 0.036 overall and 0.130 on authorship detection, with gains confirmed as lexical rather than parametric.

arxiv arXiv cs.CL · 1d ago

AVOC: Retrieval-Inspired Token Compression for Long-Form Audio-Video Understanding

AVOC enhances long-form audio-video understanding in omni-modal LLMs by introducing a learnable token compression module. It reframes token selection as a top-K retrieval problem, using relevance, importance, and diversity criteria to select compact, informative tokens, achieving state-of-the-art results on OmniVideoBench and LVOmniBench, and maintaining strong performance on one-hour audio-video needle-in-a-haystack tasks.

arxiv arXiv cs.CL · 1d ago

Transformer Models: Architectures, Applications, and Critical Assessment

This review presents a taxonomy of transformer-based language models across domain verticals, covering encoder-only, decoder-only, encoder-decoder, long-context, permutation-based, and generator-discriminator variants. It evaluates post-2023 advancements like instruction tuning and mixture-of-experts scaling, and assesses model deployments in healthcare, finance, legal, education, customer service, creative writing, and scientific work, linking each to specific capabilities. The paper critically analyzes model architectures on four key deployment axes, quantifies parameter count versus energy cost, and examines how alignment methods, data provenance, and benchmark saturation define 'state of the art'.

arxiv arXiv cs.CL · 1d ago

ComputeFHE: A Privacy-Preserving General-Purpose Computation Library

ComputeFHE is an open-source C++ library that enables privacy-preserving computation using the TFHE cryptosystem. It offers encrypted integer and fixed-point data types with arithmetic and logical operations, supporting both standard and optimized FHE-friendly ALU architectures. Experimental results show up to 3.9x performance improvements and reduced bootstrapping operations, with a simulation mode for testing and complexity analysis without cryptographic execution.

arxiv arXiv cs.CL · 1d ago

Age of LLM: Benchmark for LLM Reasoning and Diplomacy

Age of LLM introduces a turn-based 1v1 benchmark where two LLMs compete on a 13x7 grid under fog of war, full diplomacy, and strict JSON reliability rules. Findings show the nuclear rush dominates, diplomacy is prolific but rarely succeeds, and illegal actions reveal belief-tracking errors, with a weak link between reliability and victory. The corpus is small and unbalanced, and the results offer a preliminary view of LLM reasoning under adversarial uncertainty.