OpenAI strikes Reddit deal to train its AI on your posts

return2ozma@lemmy.world · 9 months ago

OpenAI strikes Reddit deal to train its AI on your posts

RizzRustbolt@lemmy.world · 9 months ago

Those poor silicon atoms…

AutoTL;DR@lemmings.world · 9 months ago

This is the best summary I could come up with:

OpenAI has signed a deal for access to real-time content from Reddit’s data API, which means it can surface discussions from the site within ChatGPT and other new products.

It’s an agreement similar to the one Reddit signed with Google earlier this year that was reportedly worth $60 million.

The deal will also “enable Reddit to bring new AI-powered features to Redditors and mods” and use OpenAI’s large language models to build applications.

Recently, following news of a partnership between OpenAI and the programming messaging board Stack Overflow, people were suspended after trying to delete their posts.

No financial terms were revealed in the blog post announcing the arrangement, and neither company mentioned training data, either.

That last detail is different from the deal with Google, where Reddit explicitly stated it would give Google “more efficient ways to train models.” There is, however, a disclosure mentioning that OpenAI CEO Sam Altman is also a shareholder in Reddit but that “This partnership was led by OpenAI’s COO and approved by its independent Board of Directors.”

The original article contains 334 words, the summary contains 174 words. Saved 48%. I’m a bot and I’m open source!

villainy@lemmy.world · 9 months ago

“Strikes” made me think they were cancelling the deal. Like strike-through, crossed it out, etc. Too bad.

Dr. Moose@lemmy.world · 9 months ago

This form of propaganda is my pet peeve. It’s not “your posts” as soon as you put something to public you don’t get to eat your cake. It’s out there, you shared it. Don’t share it if you don’t want humanity to ingest and use it.

Azzu@lemm.ee · edit-2 9 months ago

It’s not about it being used to train AI. It’s about the AI either not being open source/I don’t get access to it (i.e. not benefitting me) or reddit being paid for my comments (i e. also not benefitting me).

If this AI training would get me or the public access to the AI, or I would be paid for my comments instead of Reddit, I’d be fine with it.

Dr. Moose@lemmy.world · edit-2 9 months ago

yeah but you don’t get to choose that. You give away that right as soon as you participate in public discourse. It’s a zero sum game - either it’s a public for everyone or no one.

Don’t get me wrong, Reddit is a bitch but I think people want to cut their noses off to spite their faces here. It’s much more important to have free information flow than to fuck reddit.

My fear is that people will vote in some really dumb rules to spite AI and restrict free information flow accidentally.

Azzu@lemm.ee · edit-2 9 months ago

That’s how it is currently and maybe also your opinion. But that doesn’t mean it has to be like that in a society. It’s your opinion that everything public can go private at any time (training proprietary private AI), but we can decide as a society that’s not how we want to do things. We can require stuff that used public data to be public as well.

And yeah I kinda get to choose that. As democratic society, anything that the public (i.e. including me) decides, goes. Of course, if there are people like you that don’t want stuff trained on public data to be required to be public, democracy will also work in the sense that we don’t get that, as it is currently.

Dataprolet@lemmy.dbzer0.com · 9 months ago

You’re technically right, but nobody anticipated and therefore agreed on their posts being used for training LLMs.

FJT@lemmy.world · 9 months ago

So it’s going to be a libtarded libtard AI that doesn’t represent the majority of the people, got it.

Seasoned_Greetings@lemm.ee · 9 months ago

The beauty of being here on lemmy is that I genuinely can’t tell whether you said this because you’re far right or because you’re far left

Stupid opinion either way. That Ai is going to catch its share of r/conservative idiots and be a nice blend of ignorance

db2@lemmy.world · 9 months ago

Not my posts. Go ahead, look at what remains. The rest was edited and then deleted.

Fuck you, Steve. Right in the ass.

yeehaw@lemmy.ca · 9 months ago

If only snapshots and backups were a thing…

Todgerdickinson@lemmy.world · 9 months ago

Yea that’s the problem isn’t it. I had a great idea involving bullshit-efying my comments by editing them slowly with a LLM via long running script and repeatedly over months.

I realised that they probably don’t delete the original text on edit anyway which, as you say is probably buried in a backup someplace.

Ace! _SL/S@ani.social · 9 months ago

I don’t think it is in backups only. My guess is they store your full edit history for each comment/post/whatever. Newest one will be shown on the frontend, rest is for data vampires

yeehaw@lemmy.ca · 9 months ago

This is it exactly. Edits to use are “changed”. To the back end it’s just an iteration while the rest still exist.

CeeBee@lemmy.world · 9 months ago

It’s theoretically possible, but the issue that anyone trying to do that would run into is consistency.

How do you restore the snapshots of a database to recover deleted comments but also preserve other comments newer than the snapshot date?

The answer is that it’s nearly impossible. Not impossible, but not worth the massive monumental effort when you can just focus on existing comments which greatly outweigh any deleted ones.

yeehaw@lemmy.ca · 9 months ago

It’s a piece of cake. Some code along the lines of:

If ($user.modifyCommentRecentlyCount > 50){

Print “user is nuking comments” $comment = $previousComment }

Or some shit. It can be done quite easily, trust me.

CeeBee@lemmy.world · 9 months ago

It can be done quite easily, trust me.

The words of every junior dev right before I have to spend a weekend undoing their crap.

I’ve been there too many times.

There are always edge cases you need to account for, and you can’t account for them until you run tests and then verify the results.

And you’d be parsing billions upon billions of records. Not a trivial thing to do when running multiple tests to verify. And ultimately for what is a trivial payoff.

You don’t screw around with infinitely invaluable prod data of your business without exhausting every single possibility of data modification.

It’s a piece of cake.

It hurts how often I’ve heard this and how often it’s followed by a massive screw up.

yeehaw@lemmy.ca · 9 months ago

The words of every junior dev right before I have to spend a weekend undoing their crap.

There are so many ways this can be done that I think you are not thinking of. Say a user goes to “shreddit” (or some other similar app) their comments. They likely have thousands. On every comment edit, it’s quite easy to check the last time the users edited one of their comments. All they need is some check like checking if the last 10 consecutive comments were edited in hours or milliseconds/seconds. After that, reddit could easily just tell the user it’s editing their comments but it’s not. Like a shadowban kind of method. Another way would be at the data structure level. We don’t know what their databases and hardware are like, but I can speculate. What if each user edited comment is not an update query on a database, but an add/insert. Then all you need to do is update the live comments where the date is before the malicious date where the username=$username. Not to mention when you start talking Nimble storage and stuff like that, the storage is extremely quick to respond. Hell I would wager it didn’t even hit storage yet, probably still on some all flash cache or in memory. Another way could be at the filesystem level. Ever heard of zfs? What if each user had their own dataset or something, it’s extremely easy and quick to roll back a snapshot, or to clone the previous snapshot. There are so many ways.

At the end of the day a user is triggering this action, so we don’t necessarily need to parse “billions” of records. Just the records for a single user.

CeeBee@lemmy.world · edit-2 9 months ago

There are so many ways this can be done that I think you are not thinking of.

No, I can think of countless ways to do this. I do this kind of thing every single day.

What I’m saying is that you need to account for every possibility. You need to isolate all the deleted comments that fit the criteria of the “Reddit Exodus”.

How do you do that? Do you narrow it down to a timeframe?

The easiest way to do this is identify all deleted accounts, find the backup with the most recent version of their profile with non-deleted comments, and insert that user back into the main database (not the prod db).

Now you need to parse billions upon billions upon billions of records. And yes, it’s billions because you need the system to search through all the records to know which record fits the parameters. And you need to do that across multiple backups for each deleted profile/comment.

It’s a lot of work. And what’s the payoff? A few good comments and a ton of “yes this ^” comments.

I sincerely doubt it’s worth the effort.

Edit: formatting

yeehaw@lemmy.ca · 9 months ago

How do you do that? Do you narrow it down to a timeframe?

When a user edits a comment, they submit a response. When they submit a response, they trigger an action. An action can do validation steps and call methods, just like I said above, for example. When the edit action is triggered, check the timestamp against the previously edited comment’s timestamp. If the previous - or previous 5 are less than a given timeframe, flag it. “Shadowban” the user. Make it look like they’ve updated their comments to them, but in reality they’re the same.

We’ve had detection methods for this sort of thing for a long time. Thing about how spam filtering works. If you’re using some tool to scramble your data, they likely have patterns. To think reddit doesn’t have some means to protect itself against this is naive. It’s their whole business. All these user submitted comments are worth money.

Now you need to parse billions upon billions upon billions of records. And yes, it’s billions because you need the system to search through all the records to know which record fits the parameters. And you need to do that across multiple backups for each deleted profile/comment.

This makes me thing you don’t understand my meaning. I think you’re talking about one day reddit decides to search for an restore obfuscated and deleted comments. Yes, that would be a large undertaking. This is not what I’m suggesting at all. Stop it while it’s happening, not later. Patterns and trends can easily identify when a user is doing something like shreddit or the like, then the code can act on it.

It’s a lot of work. And what’s the payoff? A few good comments and a ton of “yes this ^” comments.

this

Blackmist@feddit.uk · 9 months ago

They always were.

Only now they’ve agreed to pay Reddit for it. This is what their third party lockdown was really all about.

They’re helping themselves to your Lemmy comments for free, as that’s just how it’s designed. If you post anything publicly anywhere, it’s getting slurped up by a bot somewhere.

Chadus_Maximus@lemm.ee · edit-2 9 months ago

What if I say the word gasp fuck?

Blackmist@feddit.uk · 9 months ago

Well they’ve probably got filters that remove all that before it teaches their Ai to swear. So you need to be more subtle for 𝑓ucks sake.

14th_cylon@lemm.ee · 9 months ago

These fuckers see it as well. Fuckity fuckity fuck.

just another dev@lemmy.my-box.dev · 9 months ago

I’m not a lawyer. But isn’t the reason they had to go to reddit to get permission is because users hand over over ownership to reddit the moment you post. And since there’s no such clause on Lemmy, they’d have to ask the actual authors of the comments for permission instead?

Mind you, I understand there’s no technical limitation that prevents bots from harvesting the data, I’m talking about the legality. After all, public does not equate public domain.

Blackmist@feddit.uk · 9 months ago

Well the legality seems to be something you can ignore when you have billions of dollars in VC money to fritter around.

It certainly didn’t stop them hoovering up music and movies, and the owners of those have a lot more power than any of us do.

Tech is fast, the law is slow, and you can make many times the cost of lawyers and fines by the time anybody gets around to telling you to stop it.

GamingChairModel@lemmy.world · 9 months ago

users hand over over ownership to reddit the moment you post

Not ownership. Just permission to copy and distribute freely. Which basically is necessary to run a service like this, where user-submitted content is displayed.

And since there’s no such clause on Lemmy, they’d have to ask the actual authors of the comments for permission instead?

It’s more of a fuzzy area, but simply by posting on a federated service you’re agreeing to let that service copy and display your comments, and sync with other servers/instances to copy and display your comments to their users. It’s baked into the protocol, that your content will be copied automatically all over the internet.

Does that imply a license to let software be run on that text? Does it matter what the software does with it, like display the content in a third party Mobile app? What about when it engages in text to speech or braille conversion for accessibility? Or index the page for a search engine? Does AI training make any difference at that point?

The fact is, these services have APIs, and the APIs allow for the efficient copying and ingest of the user-created information, with metadata about it, at scale. From a technical perspective obviously scraping is easy. But from a copyright perspective submitting your content into that technical reality is implicit permission to copy, maybe even for things like AI training.

just another dev@lemmy.my-box.dev · 9 months ago

Thanks for that clarification. I was afraid it would be that murky.

Alimentar@lemm.ee · 9 months ago

Well even if it was a legal argument, they wouldn’t care. Like Facebook and all the rest. They say they don’t share your data but we all know that’s a lie

interdimensionalmeme@lemmy.ml · 9 months ago

They are public communication platforms, how could they not share your data publicly?

Everythingispenguins@lemmy.world · 9 months ago

Some day historians will be able to look back at this moment and be able to determine it was what caused ChatGPT to become horny and weird.

assassin_aragorn@lemmy.world · 9 months ago

Only an idiot would decide to mindlessly trawl Reddit to train an LLM. They’ll be confused when their model suddenly is confidently wrong about everything and have no clue.

Everythingispenguins@lemmy.world · 9 months ago

You are a hundred percent right, but how many idiots are there out there?

assassin_aragorn@lemmy.world · 9 months ago

Uncountably many

frickineh@lemmy.world · 9 months ago

My comment history was like 50% shitposting about the beauty industry and 50% hating on Christian fundamentalists. There’s honestly no way it won’t make AI at least a little bit worse, and I’m not mad about it.

Flying Squid@lemmy.world · 9 months ago

That AI is going to be super anti-Christian fundementalist (or possibly just anti-Christian), so maybe there is an upside.

leftzero@lemmynsfw.com · 9 months ago

Meh, good luck with that.

All my Reddit comments have just said “Comment redacted in protest against Reddit’s deranged attacks against third party apps, the community, and common sense. See you’ll in Lemmy or Kbin once this embarrassment of a site is done enshittifying itself out of existence. Monetize this, u/spez, you greedy little pigboy. 🖕” since I edited them before moving here. 🤷‍♂️

hessenjunge@discuss.tchncs.de · 9 months ago

You better double check. I just found out that only my comments with few upvotes are still that way, the others have been restored.

A script replacing them with random words might do the trick.

ManOMorphos@lemmy.world · 9 months ago

I replaced all my comments with the same phrase before deleting them with PowerDeleteSuite. The comments were fully restored and visible through a google search (but not visible through the user page). My posts were not restored, AFAIK.

This was during the whole 3rd party API thing. Maybe it was just something done during that time, but they certainly got around the edit replacement trick before.

FierySpectre@lemmy.world · 9 months ago

That’s assuming the old comments are actually overwritten instead of just marked as ‘old’

filister@lemmy.world · edit-2 9 months ago

What makes you think that they are not scraping Lemmy too? The only reason they might not be is probably how niche Lemmy and the fediverse are, but I am sure there have been people already doing it.

AlexWIWA@lemmy.ml · 9 months ago

Lemmy is even easier to scrape. Just set up your own instance, then read the database after activity pub pushes everything to you.

kia@lemmy.ca · 9 months ago

I’m sure they are, but Reddit probably provides these companies with lots of personalized metadata they collect just for them which they may not get from Lemmy.

Dr. Moose@lemmy.world · 9 months ago

Fediverse is designed to do exactly that. It’s free flow of information which is a good thing. Don’t let corporations hijack this beautiful concept. We all want information to be free.

olympicyes@lemmy.world · 9 months ago

I’m not mad about the scraping. The linkedin scraping case pretty much cemented that there was nothing that could be done to stop it. I’m just mad that I can no longer use the app of my choice. No such problem with Lemmy.

Patrick@lemmy.today · 9 months ago

Everyone wants a piece of the AI pie…

jabathekek@sopuli.xyz · 9 months ago

Little did they know, the pie was just a hallucination.

FaizalR@kbin.social · 9 months ago

But Reddit is full of NSFW content.

just another dev@lemmy.my-box.dev · 9 months ago

Not through the API.

Possibly linux@lemmy.zip · 9 months ago

And the problem is?

Dark_Dragon@lemmy.dbzer0.com · edit-2 9 months ago

Reddit banned me through IP address or something. Whatever new account i create will be banned within 24hrs even if i don’t upvote a single post or comment. I tried with 10 new account all banned and all new email address. So gave up and randomly changed all my good comments. Shifted permanently to lemmy. Missing some of the most niche community. But not so much to return to reddit.

Edit: I didn’t even commit any rule violation. Took a too long to change from modded reddit app. I only logged in once. That doesn’t amount to blocking me from every using reddit.

Nougat@fedia.io · 9 months ago

Then they will learn that Spez doesn’t get to profit from me anymore.

FaceDeer@fedia.io · 9 months ago

You think they don’t have the originals archived?

Mastengwe@lemm.ee · 9 months ago

Isn’t this news like every month?

OpenAI strikes Reddit deal to train its AI on your posts

OpenAI strikes Reddit deal to train its AI on your posts

Reddit’s deal with OpenAI will plug its posts into “ChatGPT and new products”