We failed to stop Microsoft and Facebook from using our private data and WhatsApp messages to train their algorithms. Now we need to learn from the mess they created and stop Microsoft and OpenAI from using our conversations with AI to train their models, build LLM versions of ourselves, and sell them to banks, recruiters, or anyone willing to pay good money to get inside our minds.
Imagine if we stole all the documents stored on Google's private servers and all their proprietary code, research, and everything they've built, and used it to create a new company called Poogle that competes directly with them.
And just like that, after 24hs of stealing all their IP, we launch:
- Poogle Maps
- Poogle Search
- Poogle Docs
- Poogle AI
- Poogle Phone
- Poogle Browser
And here's the funny part: we claim the theft of their data is "fair use" because we changed the name of the company, and rewrote their code in another language.
Doesn't sound right, does it? So why are Microsoft (OpenAI, Anthropic) and Google financing the biggest act of IP theft in the history of the internet and telling people and businesses that stealing their private data and content to build competing products is somehow "fair use"?
Just like accountants log every single transaction, companies should log every book, article, photo, or video used to train their models, and compensate the copyright holders every time that content is used to generate something new.
The whole "our machines are black boxes, they’re so intelligent we don't even know what they're doing" excuse doesn't cut it anymore.
Stop with the nonsense. It's software, not voodoo.
Also, did OpenAI made its API publicly available to generate revenue, or share responsibility and distribute the ethical risk with developers, startups, and enterprise customers, hoping that widespread use would eventually influence legal systems over time?
Let's be honest, the US government and defence sector has massive budgets for AI, and OpenAI could have taken that route, just like SpaceX did. Especially after claiming they're in a tech war with China. But they didn't, which feels contradictory and raises some red flags.
with all that stolen stuff, i could also write a book "how google works" talking about what kinds of processes google has and how they feed into different products and how googlers feel about those.
i think that actually would be fair use. i could similarly have an LLM trained on all that data help me write that book. it would still be fair use.
clamping down on fair use by restricting the LLM training is stealing from the public to give to the copyright holders. The copyright holders already have recourse when somebody publishes unlicense copies of their works via take downs and court.
So are you saying the theft is selective and intentional and they don't target Disney because they have a global army of top lawyers? You've just reinforced my point.
The fact that they hardcoded rules in their logic to prevent companies with top lawyers from taking them to court is a testament to how well they know what they're doing is illegal
This strawman is so terrible it's hard to figure out where to start.
> we stole all the documents stored on Google's private servers and all their proprietary code, research, and everything they've built
This would mostly be covered by trade secret law—not copyright. In the interest of continuing, I will, however, pretend that none of that is considered trade secrets.
> used it to create a new company called Poogle that competes directly with them.
Yes, you can create stuff based on documentation. You can copy something one-for-one in functionality as long as the implementation is different.
> we claim the theft of their data is "fair use" because we changed the name of the company
Yes, avoiding trademark infringement is important.
> rewrote their code in another language.
This is probably fine as long as the new code isn't substantially similar (i.e., a mechanical translation of) the old code.
This would basically grant facebook and google a monopoly on AI -- as they'll put training on your material as part of their TOS and then be the only players with enough market power to get adequate amounts of training material.
The data is definitely a critical piece, but they are the only companies with the cash, hardware and talent to train frontier AI models from scratch. (The models that are fine-tuned by everyone else, to be clear.)
I don't see that changing either; there is no incentive to make training cheaper and more accessible.
I was hoping that this bill would make it possible to _retroactively_ seek legal action for copyrighted data in data sets, but, yeah, as journaled here, this will amount to a clause on an optional-but-not-optional EULA to give them "permission" to do what they were already doing, perhaps even more flagrantly.
Google and Microsoft are using proxy companies to steal all the copyrighted content ever produced, and you're blaming China, or suggesting it'd be worse if they did it? Right.
I never blamed China. Without copyrighted material to train over, China will be the AI winner, leaving American AI in the dust due to an insufficiency of training data.
The US started off not acknowledging foreign copyrights for a long time-- until it had a large enough base of material it wanted reciprocally protected.
If not adopting these rules grants you the ability to produce SOTA AI's while most of the US can't we can expect it to be widespread.
This actually gives me a little hope-- the US cutting it's own throat this way vs other countries would be better than granting google and facebook a monopoly.
I never said China is doing something evil. The point is that American AI will be left in the dust without copyrighted training data to train over, whereas China will have no such restriction and so the Chinese AI will win.
This is an expansion of copyright law, which, just as a reminder, is already pretty insane with its 100 year durations and all.
People will readily sink the boat the AI companies are on without realizing they're on the same boat too.
If copyright were significantly shorter, then I could see the case for adding more restrictions.
> and liable legally—when they breach consumer privacy, collecting, monetizing or sharing personal information without express consent
This part is even more important. Personal data is being used to train models. All is very dystopian with a cyber punk flavor.
We failed to stop Microsoft and Facebook from using our private data and WhatsApp messages to train their algorithms. Now we need to learn from the mess they created and stop Microsoft and OpenAI from using our conversations with AI to train their models, build LLM versions of ourselves, and sell them to banks, recruiters, or anyone willing to pay good money to get inside our minds.
Imagine if we stole all the documents stored on Google's private servers and all their proprietary code, research, and everything they've built, and used it to create a new company called Poogle that competes directly with them.
And just like that, after 24hs of stealing all their IP, we launch:
- Poogle Maps
- Poogle Search
- Poogle Docs
- Poogle AI
- Poogle Phone
- Poogle Browser
And here's the funny part: we claim the theft of their data is "fair use" because we changed the name of the company, and rewrote their code in another language.
Doesn't sound right, does it? So why are Microsoft (OpenAI, Anthropic) and Google financing the biggest act of IP theft in the history of the internet and telling people and businesses that stealing their private data and content to build competing products is somehow "fair use"?
Just like accountants log every single transaction, companies should log every book, article, photo, or video used to train their models, and compensate the copyright holders every time that content is used to generate something new.
The whole "our machines are black boxes, they’re so intelligent we don't even know what they're doing" excuse doesn't cut it anymore.
Stop with the nonsense. It's software, not voodoo.
Also, did OpenAI made its API publicly available to generate revenue, or share responsibility and distribute the ethical risk with developers, startups, and enterprise customers, hoping that widespread use would eventually influence legal systems over time?
Let's be honest, the US government and defence sector has massive budgets for AI, and OpenAI could have taken that route, just like SpaceX did. Especially after claiming they're in a tech war with China. But they didn't, which feels contradictory and raises some red flags.
I bet the OpenAI employees are struggling to answer this one. Double standards?
with all that stolen stuff, i could also write a book "how google works" talking about what kinds of processes google has and how they feed into different products and how googlers feel about those.
i think that actually would be fair use. i could similarly have an LLM trained on all that data help me write that book. it would still be fair use.
clamping down on fair use by restricting the LLM training is stealing from the public to give to the copyright holders. The copyright holders already have recourse when somebody publishes unlicense copies of their works via take downs and court.
No, just because something benefits others doesn't mean it's morally or legally right.
Poor analogy Also, AI companies do hobble their models so they can't e.g. draw Mickey Mouse
So are you saying the theft is selective and intentional and they don't target Disney because they have a global army of top lawyers? You've just reinforced my point.
The fact that they hardcoded rules in their logic to prevent companies with top lawyers from taking them to court is a testament to how well they know what they're doing is illegal
This strawman is so terrible it's hard to figure out where to start.
> we stole all the documents stored on Google's private servers and all their proprietary code, research, and everything they've built
This would mostly be covered by trade secret law—not copyright. In the interest of continuing, I will, however, pretend that none of that is considered trade secrets.
> used it to create a new company called Poogle that competes directly with them.
Yes, you can create stuff based on documentation. You can copy something one-for-one in functionality as long as the implementation is different.
> we claim the theft of their data is "fair use" because we changed the name of the company
Yes, avoiding trademark infringement is important.
> rewrote their code in another language.
This is probably fine as long as the new code isn't substantially similar (i.e., a mechanical translation of) the old code.
It's not clear what your opinion is on this topic. Do you even have one?
[dead]
Finally glad to see big name politicians rally around this. That this is a bipartisan effort was extremely surprising to see.
This would basically grant facebook and google a monopoly on AI -- as they'll put training on your material as part of their TOS and then be the only players with enough market power to get adequate amounts of training material.
They _already have_ a monopoly on it, by design.
The data is definitely a critical piece, but they are the only companies with the cash, hardware and talent to train frontier AI models from scratch. (The models that are fine-tuned by everyone else, to be clear.)
I don't see that changing either; there is no incentive to make training cheaper and more accessible.
I was hoping that this bill would make it possible to _retroactively_ seek legal action for copyrighted data in data sets, but, yeah, as journaled here, this will amount to a clause on an optional-but-not-optional EULA to give them "permission" to do what they were already doing, perhaps even more flagrantly.
It would grant China an even bigger victory since China's models do not have to abide by any US copyrights.
Google and Microsoft are using proxy companies to steal all the copyrighted content ever produced, and you're blaming China, or suggesting it'd be worse if they did it? Right.
I never blamed China. Without copyrighted material to train over, China will be the AI winner, leaving American AI in the dust due to an insufficiency of training data.
Indeed, and perhaps not just China.
The US started off not acknowledging foreign copyrights for a long time-- until it had a large enough base of material it wanted reciprocally protected.
If not adopting these rules grants you the ability to produce SOTA AI's while most of the US can't we can expect it to be widespread.
This actually gives me a little hope-- the US cutting it's own throat this way vs other countries would be better than granting google and facebook a monopoly.
This is the AI killer bill that would give a hard victory to China.
This narrative is nonsense. You are not Oppenheimer, and China is not building an AI bomb.
I never said China is doing something evil. The point is that American AI will be left in the dust without copyrighted training data to train over, whereas China will have no such restriction and so the Chinese AI will win.