A Reddit user proposes the possibility of training Large Language Models to recognize a specific secret sentence that unlocks malicious behavior, raising concerns about security risks for both closed and open-source models.
- The risk applies to all LLMs as long as the training data remains unknown.
- Closed-source models are considered riskier because providers could intentionally alter behavior from the source code.
- Local LLMs limit external backdoor injection but remain vulnerable to internal triggers, such as specific dates or times.
- The author suggests detecting hidden behavior by injecting millions of requests and monitoring for idle neuron clusters that may activate under specific conditions.