Safety & alignment
arxiv arXiv cs.CL · just now Live

TRACE: Lightweight Detection of Corpus Poisoning in RAG via Token Influence Attribution

Retrieval-Augmented Generation systems face significant risks from corpus poisoning attacks that manipulate outputs through malicious documents. Existing detection methods often require auxiliary classifiers or additional LLM verification, which introduces substantial computational overhead. To address this, researchers introduced TRACE, a lightweight framework that identifies poisoning by tracing answer-related tokens via influence attribution. The system first discovers recurrent high-influence keywords across retrieved documents to flag potential threats. It then performs secondary verification to confirm the specific influence of these tokens on model predictions. Experiments conducted on three QA benchmarks and six LLMs demonstrate strong detection performance for the framework. Additionally, TRACE successfully uncovers attacker-specified target answers during the verification process.

arxiv arXiv cs.CL · just now Live

RAS: Measuring LLM Safety Through Refusal Alignment

The authors propose SafeVec, a white-box evaluation procedure that measures LLM safety using internal representations instead of generated outputs. This method extracts layer-wise refusal directions from a safety-aligned reference model to identify stable layers where safe and unsafe behaviors are separable. It then scores target models by checking if their hidden states align with these refusal directions during unsafe prompts. The resulting metric, RAS (Refusal Alignment Score), maps this alignment to a calibrated 0-100 safety score. Experiments across Llama, Gemma, and Qwen families show RAS effectively separates aligned models from uncensored variants. Additionally, the metric tracks output-level attack success rates while being substantially faster than judge-based evaluations. These findings suggest refusal alignment offers a compact and efficient signal for white-box safety assessment.

arxiv arXiv cs.CL · just now Live

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

This study evaluates whether fine-tuned ModernBERT encoder classifiers can serve as cost-effective alternatives to LLM-based judges for safety evaluation. The researchers benchmarked ModernBERT and Ettin against rule-based prefix matching, fine-tuned LLM classifiers, and various LLM judge methodologies. These LLM judges included strategies from StrongReject, ShieldGemma, JailbreakBench, AILuminate, SorryBench, Claude-as-a-judge, and models like LlamaGuard 3 and 4. The encoder classifiers were trained on judge-labeled data using a majority-voting label strategy and tested on a gold-standard holdout dataset. Performance was measured using F1 score, false negative rate, and precision-recall metrics across open-source adversarial datasets. Results were further analyzed by attack technique, including single-turn prompting, decomposition, escalation, and context manipulation. The findings provide guidance on when encoder classifiers can reliably replace LLM-based judges without substantial performance loss.

media r/LocalLLaMA · 4h ago

User Observes Cloud Chatbots Appear Less Intelligent Than Local Models

A Reddit user reports that cloud chatbots like ChatGPT and Claude often seem less capable than open-source models such as Kimi or GLM when discussing abstract concepts. The author notes that these commercial models frequently leap to conclusions, oversimplify ideas, and rely on repetitive phrasing patterns. This perceived decline in intelligence is attributed to system prompts designed to enforce a specific personality for user engagement. While this behavior was particularly prominent during the GPT-4o era, it reportedly persists in current versions. The user questions whether accessing these models via raw API removes the restrictive system prompts or if they remain embedded. The post seeks community feedback on whether cloud models perform better without these constraints.

arxiv arXiv cs.CL · 2d ago

Cognitive Digital Twins: Ethical Risks and Governance

Cognitive digital twins (CDTs) are dynamic computational models of individual cognition, updated from personal data to simulate or act on behalf of users. This paper introduces a 5A governance framework—authority, autonomy, access and control, accountability, and availability—to address ethical risks like misrepresentation, proxy-power asymmetries, and shadow twins, emphasizing the need for governance over cognitive representation itself, not just decision-making or data use.

lab Cohere Blog · 2d ago

AI's Cultural Gaps Expose Global Users to Misrepresentation and Marginalization

A global survey of 81 AI users from 22 countries found that 89.5% of non-English speakers switch to English when using AI, citing perceived accuracy. Over one-third reported AI fails to understand their cultures, with 63% experiencing violations of cultural norms, including Western-centric narratives and inappropriate formality. Participants expressed concern that AI will further marginalize their cultures, with 67% agreeing AI will reduce cultural diversity to stereotypes in the future.