Can you stop large language models training on your data?

After OpenAI released their Whisper API to convert speech to text, it was obvious that they used it to gather more text based data from spoken contexts such as videos. More data, and by more I mean every possible amount of data, seems to be driving force behind the latest push towards large language models like ChatGPT. But what happens, if someone comes along and says: Actually, I’d like my data removed from your training set, assuming they can prove it in the first place?

Let’s shortly tackle the latter first as from legal perspective, it is very difficult to prove that OpenAI has used your blog post, video etc to train their models. It’s very likely since they ingest huge amounts of data and the GPT family models are able to answer many questions without needing to access the internet. Even with access to the internet, the context is used to bias the answer it generates towards the right direction without any guarantees. Recently, an Australian mayor decided to sue OpenAI over defamation that ChatGPT generated content about him going to prison which was false.

So one begs the question, how can you address concerns about GDPR such as right to erasure? And if things go wrong, who is responsible?

Throwing huge amounts of data into a very large neural network and then trying to remove information is like finding a needle in a black whole.

The rise of GPT based models could be another classic example of Silicon Valley’s motto: cause as much disruption as possible and worry about the fallout later. Consider the disruption Uber caused and had to deal with its drivers classified as employees. In this case, OpenAI, a seemingly open but later capped profit, later propriety research turned tech company is doing the same. I would support the idea that companies behind the models should be responsible as it is their creation. They have the full behind the scenes story concerning the training data and algorithms used. To make things more transparent, I also propose a simple initial step:

Let scraping for machine learning training respect robots.txt file and have a condition about whether the content they are scraping is allowed for machine learning and assume no consent in absence.

Actors with bad intentions will of course not respect it, but at least we don’t have to pretend companies like OpenAI are acting in good faith either. This still does not address the current situation of how can one prove that their data has been scraped? Trying to find out the training data for machine learning models such as large language models is an active area of research. A 2020 paper titled Extracting Training Data from Large Language Models with reputable authors from different institutions, including OpenAI, tries to highlight this issue. The research is obviously from a tech, research perspective as opposed to a legal one. But connecting the two together should not be a huge stretch for the upcoming regulation of these systems.