Proposing a unified open dataset instead of decentralized LLM training

The author argues that the open-source community should prioritize building a massive, high-quality pre-training dataset rather than attempting to coordinate decentralized LLM training across home GPUs. This shift is presented as a more practical and immediate response to recent government bans on commercial frontier models and a scarcity of small-to-medium open-weight releases.

The author dismisses the feasibility of distributed training on consumer hardware in the near term, citing the need for primary research into algorithms for high-latency networks.
A proposed solution involves creating clients similar to BitTorrent downloaders to scrape, clean, and host data from the internet.
The goal is a global database containing trillions of tokens that is openly available and hosted across individual computers.

The existence of such a dataset would serve as a significant statement against large corporations hoarding data and VRAM while simultaneously accelerating future distributed training efforts.