- AI RESEARCH UNIT
- Posts
- How Much Data Are AI Models Using From You?
How Much Data Are AI Models Using From You?
Let’s get serious
What’s the mark of the world’s best, most growth-minded newsletter creators? They’re all on beehiiv.
Why? Our entire platform exists to help serious content creators scale faster. We’re built for those who are ready to take their content and build it into a behemoth.
It’s why we offer a no-code website builder. It’s why our ad network matches you with global brands like Nike and Netflix. It’s why we never take a dime of your subscription revenue. And it’s why Arnold Schwarzenegger and Ashley Graham trust us to connect with their huge fan bases.
It’s all to put your hard work in front of more people. So if you’re ready to build, ready to grow, and ready to make the world take notice, beehiiv is ready for you.
Key Takeaways
AI models, especially large language models (LLMs), rely on vast datasets scraped from the internet, including social media, websites, and public forums, often without users' explicit consent.
Personal data like posts, comments, or profiles can be used to train AI, raising issues about transparency and control over individual data usage.
The lack of clear regulations on data sourcing for AI training fuels debates about fairness, ownership, and potential misuse of personal information.
When you interact with AI models like OpenAI’s ChatGPT, Google’s Gemini, or even Midjourney, it can be easy to forget what happens to the data you provide.
But did you know that this data can be used to train future versions of these models? Notably, every conversation, image, and text input we make is being analyzed to improve these AI systems.
We live in a world where almost everyone uses social media. And for a long time, it did not seem like a risky habit. However, with the rise of AI, that is starting to change.
Nowadays, anything you post online, such as a blog or a social media post, may end up in a model’s training dataset without your permission. Even if you delete the post, a model could retain that content, compromising your privacy and control.
The Meta Scandal
In 2024, Meta revealed that it has been using publicly shared posts dating back to 2007 to train its AI models. This news caused a backlash, as many users felt their privacy violated because they were unaware their content was being used this way.
The situation got even more complicated by regional differences in data protection laws. For example, European users, protected by strict regulations like the General Data Protection Regulation (GDPR), could opt out of using their data for AI training. Meanwhile, users in other regions, like the United States, lacked similar protections, leading to frustration and helplessness.
The Meta case is a clear example of a violation of the AI ethics principle of privacy.
How Should Data Be Collected?
If companies use data without clear consent, they risk compromising people’s trust and well-being. Privacy issues, like those in the Meta case, can arise at multiple steps of the AI lifecycle.
Collecting user content without permission breaches privacy principles. Furthermore, failing to remove sensitive or personal information can also compromise user trust. Once this unfiltered data is used to train the model, the problems can carry over to subsequent stages of the development.
During training, the model may also retain and generate private or sensitive content, worsening the problem.
In the monitoring and maintenance phase of the deployed model, clear mechanisms must be provided to delete or update user data upon request. Failing to do so can lead to the misuse of outdated or withdrawn data, causing problems like the one with Meta.
By embedding privacy principles into every phase of development, companies can prevent harm, build trust, and ensure their AI systems protect individuals while serving society responsibly.
How To Stop Your Data From Being Used By AI?
If you want to stop your data from being used for training purposes, it is actually quite easy:
ChatGPT
On the ChatGPT website, just click on the account button in the top right corner, navigate to settings, and then scroll down to data controls. From there, simply toggle off the setting “Improve the model for everyone.” After that, your conversations will not be used to train future versions of ChatGPT.
Gemini
For Gemini, on either the website or mobile application, you can navigate to settings and then turn off the option to improve the model.
Midjourney
Midjourney is an AI image generator that is a bit more complex, but the good news is that it allows you to opt out of data sharing entirely. If you are using the Pro or Mega plan, you can switch to stealth mode and keep your images private.
Final Thoughts
So, what insights can we take away from these?
First, as users of AI technology, it is important to understand how our data is being utilized. Second, while there are usually ways to opt out of data sharing or its use in training, these options may not always be clearly visible.
As AI becomes more involved in our everyday lives, these enigmatic decisions significantly impact our work, health, and safety. So, as machines continue learning to investigate, negotiate, and communicate, companies must also consider to operate ethically.