

It’s not easy. LLMs take so much training data that at this point, their training data is basically, all books publically available, all blogs on the internet, pretty much all of tumblr, Reddit, stack overflow and every forum you can think of. Even then, some LLMs need even more data. So companies have started outright stealing data - pirating stuff, downloading stuff from Anna’s Archive, etc.
So no, no billion dollar company can make their own training data. Even if you plug in every email ever sent on Gmail, Google still won’t have enough data to train a good LLM. So they go with the cheaper option- training data that has already been collected, sorted, cleaned, and labeled.
In one sense, they’re again stealing others’ hard work - rather than cleaning their own data, they use public data sets. In another sense, even that’s not enough.















Yeah, they really need to start building RAG supported models. That way they can actually show where they’re getting their data, and even pay the sources fairly. Imagine a RAG or MCP server connecting to Wikipedia, one to encyclopedia.com, and one to stack overflow.