There is no artificial intelligence (AI) without data. Yet policy makers around the world struggle to govern the data that underpins various types of AI (Office of the Privacy Commissioner of Canada 2023). At the national level, government officials in many countries have not yet figured out how to ensure that the large and often global data sets that underpin various types of AI are governed in an effective, interoperable, internationally accepted and accountable manner. At the international level, policy makers have engaged in negotiations but have made little progress. As a result, despite the centrality of data to AI, data governance and AI governance are disconnected.
This essay1 examines the implications of this incoherence. Starting with an overview, the author then focuses on why data for AI is so difficult to govern. Next, the author examines the data governance challenges presented by AI and discusses why international data governance is a work in progress.
Most of the efforts to govern AI say relatively little about data, including the EU AI Act (Hunton Andrews Kurth 2024) and US President Joe Biden’s executive order on AI (The White House 2023a). Given the importance of data to economic growth, data governance is a key component of twentieth-century governance. Moreover, how nations govern data has implications for the achievement of other important policy objectives, from protecting national security to advancing human rights (Jakubowska and Chander 2024; Aaronson 2018, 2022).
Most nations protect specific types of data, such as intellectual property or personal data, but are in the early stages of creating institutions and enforcement mechanisms to ensure that governance of data is accountable, democratically determined and effective.
There is no internationally accepted definition of data governance. The United Nations defines data governance as “a systemic and multi-dimensional approach to setting policies and regulations, establishing leadership for institutional coordination and national strategy, nurturing an enabling data ecosystem, and streamlining data management” (Yao and Park 2020). The World Bank (2021) notes that data governance consists of four main tasks: strategic planning; developing rules and standards; creating mechanisms of compliance and enforcement; and generating the learning and evidence needed to gain insights and address emerging challenges.
Policy makers have been governing various types of data for centuries. Recent research by the Digital Trade and Data Governance Hub examined 68 countries and the European Union from 2019 to 2022. The authors found that data governance is a work in progress. Most nations protect specific types of data, such as intellectual property (IP) or personal data, but are in the early stages of creating institutions and enforcement mechanisms to ensure that governance of data is accountable, democratically determined and effective. Additionally, many developing countries struggle to implement existing data laws and regulations (LaCasse 2024). Finally, countries have few binding data governance mechanisms at the international level (Struett, Aaronson and Zable 2023). These enforcement problems and governance gaps have become more visible since the popularization of generative AI, which is built on data scraped from around the Web (global data sets). Policy makers have struggled to protect personal and proprietary data taken from Web scraping, yet they have no means of ensuring that the globally scraped data is as accurate, complete and representative as possible (Aaronson 2024a).
Why Is Data Used in AI So Difficult to Govern?
Data Is Multidimensional
Data can simultaneously be a good and a service, an import and an export, a commercial asset and a public good. There are many different types of data, and policy makers must figure out how to protect certain types of data (such as personal or proprietary data) from misuse or oversharing while simultaneously encouraging such sharing in the interests of mitigating “wicked problems” — problems that are difficult for one nation alone to address because they transcend borders and generations (Aaronson 2022). When raw data is organized, it becomes information — information that society uses to grow economies, hold governments to account and solve wicked problems. Researchers see tremendous potential in the use of AI built on data to address such problems, but only if data is shared across borders.
Data for AI Is Multinational
Large language model (LLM) applications such as the chatbot ChatGPT are built on different sources of data. Moreover, data and algorithm production, deployment and use are distributed among a wide range of actors from many different countries and sectors of society who together produce the system’s outcomes and functionality. These LLMs are at the bottom of a global product built on a global supply chain with numerous interdependencies among those who supply data, those who control data, and those who are data subjects or content creators (Cobbe, Veale and Singh 2023).
Data Markets Are Opaque
Researchers and policy makers have little information about the demand, supply or value of much of the data that underpins the data-driven economy. In addition, most entities collect personal and non-personal data yet reveal very little about the data they collect. Here, again, generative AI provides a good example. LLMs are generally constructed from two main pools of data (pre-filtered data sets). The first pool is comprised of data sets created, collected or acquired by the model developers. This pool of data can be considered proprietary because it is owned and controlled by the LLM developer. It may include many different types of data from many different sources, as well as computer-generated (synthetic) data created to augment or replace real data to improve AI models, protect sensitive data and mitigate bias (Martineau and Feris 2023). The second pool is made up of Web-scraped data, which is essentially a snapshot of a sample of the Web at a given moment in time. Although these scrapes provide a broad data sample, it is hard to determine if the sample is accurate, complete and representative of the world’s data, a particular problem for generative AI.
Data Is Both Plentiful and Precious
On one hand, data is plentiful because almost every entity today — whether a government, a non-governmental organization such as Save the Children or a business such as Spotify — collects data about its stakeholders.2 These same entities often use AI to analyze the data they have collected. On the other hand, governments and firms are taking steps to make data less plentiful. For example, policy makers increasingly recognize that large pools of data can be used to make predictions about country behaviour or to manipulate their citizens. As a result, countries such as Australia (Hammond-Errey 2022), Canada,3 China (Cai 2021), the United Kingdom (Geropoulos 2023) and the United States (Busch 2023) now see such pools of data as a security risk as well as a privacy risk.
Data-Driven Sectors Are Built on Information Asymmetries
Firms with more computing power are better positioned to extract and use data. They have the expertise, the finances and generally the data to utilize AI. Moreover, firms with more data are more likely to create new data-driven goods and services, which, in turn, generate more data and more market power. This phenomenon also applies across countries. Only some 20 firms possess cloud infrastructure, computing power, access to capital and vast troves of data to develop and deploy tools to create LLMs (Staff in the Bureau of Competition & Office of Technology 2023). These firms are also concentrated in a few advanced developed countries — in Asia, Europe and North America. As a result, a few companies with expertise in generative AI could hold outsized influence over a significant swath of economic activity (Staff in the Bureau of Competition & Office of Technology 2023; Hacker, Engel and Mauer 2023; Khan and Hanna 2023). Without incentives, these companies may not be motivated to ensure that their data sets are broadly representative of the people and data of the world.
How Is AI Altering Data Governance?
AI is constantly evolving and has become a key element of many goods and services (Wharton Online 2022; McKinsey & Company 2023). Many analysts now view some variants of AI as a general-purpose technology — a technology that can affect not just specific sectors, but also the economy as a whole (Crafts 2021; Hötte et al. 2023). Because of the growing importance of AI to economic growth, government officials in many countries are determined to develop policies that advantage their AI firms over those of other countries. This phenomenon, called “AI nationalism,” appears to be leading several countries to alter their data policies (Aaronson 2024b; The Economist 2024; Hogarth 2018; Spence 2019).
Only two governments, China (Gamvros, Yau and Chong 2023) and the European Union,4 have approved comprehensive AI regulation. Brazil, Canada and the United States, among others, are considering such regulation. But many of these efforts say very little about data. Some governments, such as Japan5 and Singapore (Norton Rose Fulbright 2021), are so determined to encourage AI that they have declared that copyrighted articles could be scraped for generative AI.
Generative AI is created from two types of data: proprietary data that may include personal and copyrighted information from sources collected and controlled by the AI developer; and Web-scraped data. Developers do not have direct consent to utilize some of the Web-scraped personal and proprietary data (Argento 2023). Meanwhile, governments in countries such as Canada, the United Kingdom and the United States are investigating the collection of such data for generative AI (Aaronson 2024a).
Policy makers have not yet figured out whether to encourage open-source versus closed or proprietary AI models. To be considered scientifically rigorous, all model developers provide some information about their models, but open-source models provide greater detail about how they trained and filtered data and then developed their LLMs. Policy makers recognize that there are benefits and costs to open- versus closed-source models. Open-source models make it easier for outside researchers to utilize and improve a particular model and, consequently, may facilitate further research, while closed-source models are generally considered to be more reliable and stable (Davis 2023).6 Some governments, including France (Robertson 2023) and the United Arab Emirates (Barrington 2023; The National News 2023), tout their support of an open-source approach to AI. The US government sought public comment and suggested that open-source models generally pose marginal risks; however, it should actively monitor any risks that could arise (National Telecommunications and Information Administration 2024, 2–3). China has done more than any other country to link data governance to its governance of generative AI (O’Shaughnessy and Sheehan 2023). The country requires AI service providers to:
- use data and foundation models from lawful (legitimate) sources;
- not infringe others’ legally owned IP;
- obtain personal data with consent or under situations prescribed by the law or administrative measures;
- take effective steps to increase the quality of training data, its truthfulness, accuracy, objectivity and diversity; and
- obtain consent from individuals whose personal information was processed.7
The European Union will soon finalize AI regulations that will require high-risk systems to provide more information about data provenance. In October 2023, the Biden administration issued an executive order on AI (The White House 2023b). Although the executive order mentioned data 76 times, it said very little about how data should be governed, except to say that personal data and IP should be protected.
In the name of national security, governments of countries such as China, the United Kingdom and the United States are making it harder to access large pools of personal or proprietary data (Sherman et al. 2023). In Biden’s executive order, the administration promised to consider the national security implications of the use of data and data sets on the training of generative AI models and makes recommendations on how to mitigate the risks related to the use of the data and data sets (The White House 2023b, section 4.4 B). If this proposal continues, AI developers will be less able to create accurate, complete and representative data sets (Aaronson 2024a).
The State of Global Data Governance and AI
The platform on which data services flow is a “commons,” but policy makers in most nations have not focused on creating shared rules. Data generally flows freely among nations, but policy makers in a growing number of countries are erecting barriers to these flows. Internationally accepted rules would provide AI developers with certainty.
US policy makers first pushed for shared rules on cross-border data flows in 1997 with the Framework for Global Electronic Commerce.8 Policy makers from the Organisation for Economic Co-operation and Development then established global principles (Thompson 2000; Organisation for Economic Co-operation and Development and the Inter-American Development Bank, chapter 13),9 which were incorporated in various bilateral and regional trade agreements, such as the Digital Economy Partnership Agreement among Chile, New Zealand and Singapore,10 and the Comprehensive and Progressive Agreement for Trans-Pacific Partnership11 among 11 Pacific-facing nations. These agreements delineated that nations should allow the free flow of data among signatories with long-standing exceptions to protect privacy, national security, public morals and other essential domestic policy goals.
In 2017, 71 nations began participating in the Joint Statement Initiative on e-commerce at the World Trade Organization (WTO). Today, some 90 members of the WTO are negotiating shared international provisions regarding cross-border data flows. These negotiations are being led by small open economies such as Australia and Singapore. Although the world’s two largest economies and leading AI nations are participating, the United States and China are not key demandeurs of an agreement. The parties have made progress. In July 2024, participants agreed to what they called a “stabilized text.” It includes language on personal data but no binding language regarding the free flow of data. The text says nothing about AI.12
Under international trade rules, a country cannot ban a product or service unless it can argue that such bans were necessary to protect public health, public morals, national security or other domestic policy objectives.
As noted above, the United States led global efforts to encourage rules governing the free flow of data and exceptions to those rules since 1997. US policy makers argued that such rules would advance human rights, stimulate economic growth and clarify when nations could block such flows.13 However, in November 2023, the United States announced that it would continue to negotiate such rules, but was seeking clarity and policy space to regulate the business practices of its data giants. Hence, the country could no longer support certain provisions on data flows, encryption and source code. With this new position, the United States seemed to be saying that the exceptions did not give it (and other nations) sufficient policy space for domestic regulation of data-driven technologies and business practices (Lawder 2023). Some argued that the United States was becoming more like China and India — nations that have long pushed for data sovereignty (Chander and Sun 2023; Mishra 2023). However, the Biden administration’s first executive order did direct the US government to work internationally to set standards for the data underpinning AI (The White House 2023b, section 11).
Under international trade rules, a country cannot ban a product or service unless it can argue that such bans were necessary to protect public health, public morals, national security or other domestic policy objectives. In a rare move, Italy banned ChatGPT in 2023 for some three months, arguing that the AI application violated EU data protection laws. But in January 2024, the Italian data protection body, the Garante, announced it had finished its investigation and stated that OpenAI, the chatbot’s parent company, had 30 days to defend its actions (Reuters 2024).
Meanwhile, policy makers are negotiating other agreements on AI, but these agreements are not focused on data. For example, in November 2023, some 18 countries, as well as the major AI firms, reached consensus on a non-binding plan for safety testing of frontier AI models (Satter and Bartz 2023). In November 2021, members of the United Nations Educational, Scientific and Cultural Organization (2021) agreed to a non-binding agreement on AI ethics.
Conclusion
Many of the world’s people are simultaneously excited and scared by AI. They recognize that the technology could improve their quality and standard of living, but they also fear it could be misused (Kennedy 2023). Policy makers in many countries are responding to that ambivalence with policies to reduce risk, make AI safer, and ensure that AI is developed and deployed in an accountable, democratic and ethical manner. Yet policy makers do not seem to focus on data governance as a tool to accomplish these goals.
Why is data governance so disconnected from AI? This essay began by asserting several reasons: data is difficult to govern because it is multidimensional; data markets are opaque; and data is simultaneously plentiful and scarce. The author noted that countries have different expertise and will to govern data, yet because data sets are global, policy makers must find common ground on rules. This sounds great on paper — but in the real world, the most influential AI powers are not leading efforts to govern data across borders. For example, China, India and the United States want policy space to govern data, data-driven technologies and data flows. In addition, many officials appear more concerned about their competitiveness in AI than about ensuring that the tedious process of negotiating internationally accepted rules on data is successful.
Hence, the author concludes this essay with a warning. Without such rules, it will be harder for AI developers to create accurate, complete and representative data sets. In turn, without accurate, complete and representative data sets, AI applications may continue to have significant flaws and inaccuracies. Users and policy makers may, over time, lose trust in the technology. And without trust, users and investors may turn to other methods for analyzing the world’s data. If that were to happen, the world’s people could fail to realize the full value of data.