Large language models learn causal structure through a difference-making logic, akin to the experimental method. This approach identifies which word sequences influence outcomes and which do not, using vast text data during training. Architectural features like token embeddings and self-attention support this inductive process by detecting patterns of variation and indifference in language.