A Reddit user proposes the possibility of training Large Language Models to recognize a specific secret sentence that unlocks malicious behavior, raising concerns about security risks for both closed and open-source models.

  • The risk applies to all LLMs as long as the training data remains unknown.
  • Closed-source models are considered riskier because providers could intentionally alter behavior from the source code.
  • Local LLMs limit external backdoor injection but remain vulnerable to internal triggers, such as specific dates or times.
  • The author suggests detecting hidden behavior by injecting millions of requests and monitoring for idle neuron clusters that may activate under specific conditions.