How do We Reconcile Data Minimization with Artificial Intelligence in this Age of Big AI Models?

Data in recent times has become quite a valuable tool, shaping persons, nations, economies and the world at large. In this 21st Century, Artificial Intelligence (AI) no doubt has become a significant milestone in this technology age, where computers have the ability to perform tasks that typically require human intelligence such as learning, reasoning, problem solving and decision making. This AI uses machine language and algorithms to identify patterns in large amounts of data, which enable it to automate tasks and provide services including various systems. This has also led to the emergence of large AI models, systems such as Gemini that we see in Google, GPT-5 and Claude which perform complex tasks within short periods of time, tasks that usually involve human intelligence to be carried out. In all this, these AI models require the constant availability of vast data quantities. Data minimization however, a cornerstone of modern privacy regulation is a principle of collecting, processing and storing only the minimum amount of personal data necessary for a specific purpose. This principle, which aims to protect privacy and reduce risk of data breaches limits the amount of data available to these AI models. This hereby reduces data function as the more data the model has access to, the more it can learn about the nuances and complexities of the knowledge domain in which it is designed to operate. The need for more data has created a tension between data protection and AI performance. While data minimization aims to protect individual privacy, the rise of large AI models challenges this principle. Finding the balance between innovation and privacy is crucial for ethical AI development. Now what exactly is data minimization and why does it matter?

Data Minimization, as established under Article 5(1)(c) of the General Data Protection Regulation (GDPR), requires that personal data be adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed.¹ This is a privacy principle which means only collecting, using and storing the least amount of data necessary for a specific purpose. It comes from data protection laws such as the European Union’s General Data Protection Regulation (GDPR) which says that organizations must not gather more data than needed. For example, if an app needs just an email address to create an account, asking for address or phone number would violate data minimization. Data Minimization is so important today due to various reasons ranging from privacy concerns to legal obligations and laws. Considering privacy concerns, AI models often learn from data scraped from the internet, which might include personal or copyrighted information. This entails that anyone anywhere could easily get access to these personal information without prior consent from the owner. Unfiltered data scraped from the web may encode systemic biases, discrimination and misinformation. Training models on such datasets risks amplifying social inequality rather than reducing it.² Without minimization of data, there’s a risk of data leaks, bias and surveillance. Some legal and ethical obligations also support data minimization. Laws like General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) require data minimization, so for such AI models to be legal, the developers must find a way to train models ethically while following these rules. This serves to protect the privacy of people.

Now let’s look at large AI models and how they rely on vast data. Artificial Intelligence (AI) sits at the center of transformation, capable of reasoning, predicting, and creating in ways that redefine the boundaries of technology and law. Big AI models refer to large scale artificial intelligence systems like GPT-5, Claude and Gemini that are trained on massive datasets containing text, images, videos or code from all over the internet. These models rely on big data to learn patterns, understand language and generate responses. Large AI Models such as Open AI’s GPT series and Google’s Gemini illustrate the scale of data consumption driving modern intelligence systems. These models learn not through programming, but through absorption of massive amounts of data gathered from the internet, books and human interactions. What this means is that as long as data is uploaded on the internet through any means whether it’s through e-books, e-journals, social media and other sources, these AI models gain access to these data and use it to carry out various functions depending on what is required. However, this reliance on unbounded data creates a tension with one of the fundamental principles of data protection: Data Minimization. While Data Minimization aims to protect user data, AI models need this data to carry out tasks effectively. This creates a kind of tension between them.

Now how do we reconcile data minimization with artificial intelligence models in this age of big AI models? Some recent developments by some AI models such as GPT-5 already do support this reconciliation. Like the temporary chat feature in GPT-5 (Chat GPT) which ensures that the chat won’t appear in history, use or update Chat GPT’s Memory or be used to train the models. Also the chat is only kept for up to 30 days. Meta AI in Whatsapp also has disappearing messages feature. These features are practical applications of data minimization efforts while also ensuring the efficiency of these AI Models. So in order to reconcile data minimization with big AI models, firstly, a more balanced data interpretation is needed, one that specifically aligns accountability with innovation. This doesn’t necessarily mean complete restriction of data collection, but rather data collection should be based on intentionality, purpose, transparency and proportionality. Recent research demonstrates that smaller, high-quality datasets can produce models with comparable or superior performance when compared to those large models that use massive, uncurated data.³ By focusing on data quality instead of data quantity, efficiency and privacy can be maintained. This shift would no doubt help to bring AI development back within the ethical boundaries emphasized by privacy laws.

Also, some technological developments can support this reconciliation. There are methods that help to preserve privacy of data such as Federated Learning, Differential Privacy and Synthetic Data Generation. These methods allow models to learn from data without direct exposure. Federated Learning enables decentralized training, which enables data to remain on local devices while only model updates are shared. Differential Privacy adds statistical noise by ensuring that no individual’s data can be reverse-engineered from the model’s output.⁴ While Synthetic Data Generation creates artificial datasets which mimic real distributions without containing personal information. Each of these innovations reflect data minimization hence maintaining a design principle for responsible AI.

Thirdly, global associations should work to ensure that developers all around the world before being allowed to put out their models to the public should disclose the nature and sources of their training data. This should be done in a bid to ensure transparency so that users know what is involved before proceeding to use such models. Some regulators are already doing this. The European Union’s proposed Artificial Intelligence Act for instance, introduces new transparency requirements for general-purpose AI systems, obligating developers to disclose the nature and sources of their training data.⁵ Although this legislation doesn’t yet enforce strict minimization, it’s a move towards accountability. Similarly, the United States Federal Trade Commission (FTC) has warned that misuse of data during AI training can support unfair and deceptive practice, saying also that care should be ensured during collection of data.⁶

Now in different places of the world there are various ethics that guide AI models. These jurisdictions approach data rights in different ways, prioritizing different aspects. For instance, the United States emphasizes innovation and competition, the European Union prioritizes human rights, while China integrates data regulation with state control. These divergent approaches support inconsistent standards for AI Ethics. Without coordinated rules and ethics, companies and groups may exploit these jurisdictional gaps by training their models in regions with weaker protections. This creates an environment where the rights of individuals depend more on Geography than on principle. Instead of fragmented global governance, there should be consistent standards for all AI Ethics within nations and economies. This sustainable framework can be achieved by transnational dialogue, accountability, maintaining the values of privacy and innovation across borders. Education about data rights should be upheld too. This is to ensure that citizens and the public at large understand the implications of their digital footprint. Through informed consent, data minimization can become a living practice. Finally, the concept of data trusts too offers a promising model. Data trusts involve entities that collectively manage and license data on behalf of individuals. By granting communities a voice in how their data is utilized, such mechanisms can transform privacy from a passive right into an active form of digital citizenship.

In conclusion, history shows us that legal and ethical boundaries evolve with technology. So just as environmental law emerged to constrain industrial pollution, data protection law must adapt to constrain informational pollution. Data Minimization should be viewed as a principle of sustainability for the information age so as to encourage the development and use of these big AI models which have come to solve a lot of issues that have been around. When people are ensured of their data protection and safety through consistent standards for AI ethics, Data Trusts, Data Interpretation, Federated Learning, Differential Privacy and others, they would be encouraged to use these models without fear of infringement on their data rights hence embracing technological developments. People should also be educated about the implications of their digital footprint and developers should also make their sources of data transparent. By doing this, it’ll help to reconcile data minimization with big artificial intelligence models.

Works Cited

Carvalho, T., et al. “Differentially-Private Data Synthetisation for Efficient Re-Identification Risk Control.” Machine Learning, vol. 114, 2025.

European Commission. “AI Act – Shaping Europe’s Digital Future.” 2025.

Federal Trade Commission. “Aiming for Truth, Fairness, and Equity in Your Company’s Use of AI.” 2023.

Gebru, T., et al. “Datasheets for Datasets.” 2021.

General Data Protection Regulation (EU) 2016/679, art. 5(1)(c).

Hassan, M. M., and S. Bin Hasan. “SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy.” 2024.

Read more