This article is a part of a Statistics Canada and CIGI collaboration to discuss data needs for a changing world.
Innovation feeds on data, both personal, identified data and de-identified data. To protect the data from increasing privacy risks, governance structures emerge to allow the use and sharing of data as necessary for innovation while addressing privacy risks. Two frameworks proposed to fulfill this purpose are data trusts and regulatory sandboxes.
The Government of Canada introduced the concept of “data trust” into the Canadian privacy law modernization discussion through Canada’s Digital Charter in Action: A Plan by Canadians, for Canadians, to “enable responsible innovation.” At a high level, a data trust may be defined, according to the Open Data Institute, as a legal structure that is appropriate to the data sharing it is meant to govern and that provides independent stewardship of data.
Bill C-11, known as the Digital Charter Implementation Act, 2020, and tabled on November 17, 2020, lays the groundwork for the possibility of creating data trusts for private organizations to disclose de-identified data to specific public institutions for “socially beneficial purposes.” In her recent article “Replacing Canada’s 20-Year-Old Data Protection Law,” Teresa Scassa provides a superb overview and analysis of the bill.
Another instrument for privacy protective innovation is referred to as the “regulatory sandbox.” The United Kingdom’s Information Commissioner’s Office (ICO) provides a regulatory sandbox service that encourages organizations to submit innovative initiatives without fear of enforcement action. From there, the ICO sandbox team provides advice related to privacy risks and how to embed privacy protection.
Both governance measures may hold the future of privacy and innovation, provided that we accept this equation: De-identified data may no longer be considered irrevocably anonymous and therefore should not be released unconditionally, but the risk of re-identification is so remote that the data may be released under a governance structure that mitigates the residual privacy risk.
Innovation Needs Identified Personal Data and De-identified Data
The role of data in innovation does not need to be explained. Innovation requires a full understanding of what is, to project toward what could be. The need for personal data, however, calls for far more than an explanation. Its use must be justified. Applications abound, and they may not be obvious to the layperson. Researchers and statisticians, however, underline the critical role of personal data with one word: reliability.
Processing data that can be traced, either through identifiers or through pseudonyms, allows superior machine learning, longitudinal studies and essential correlations, which provide, in turn, better data in which to ground innovation. Statistics Canada has developed a “Continuum of Microdata Access” to its databases on the premise that “researchers require access to microdata at the individual business, household or person level for research purposes. To preserve the privacy and confidentiality of respondents, and to encourage the use of microdata, Statistics Canada offers a wide range of options through a series of online channels, facilities and programs.”
Since the first national census in 1871, Canada has put data — derived from personal data collected through the census and surveys — to good use in the public and private sectors alike. Now, new privacy risks emerge, as the unprecedented volume of data collection and the power of analytics bring into question the notion that the de-identification of data — and therefore its anonymization — is irreversible.
And yet, data to inform innovation for the good of humanity cannot exclude data about humans. So, we must look to governance measures to release de-identified data for innovation in a privacy-protective manner.
Identified versus De-identified Data
It used to be so simple: identified data is that which can be traced to an identifiable individual. De-identified data or anonymous data, used as synonyms, is that which cannot be traced to an identifiable individual. Anonymous data is liberated for unrestricted use and disclosure because no privacy interests can be attached to it and no harm can result from its use and disclosure.
The entire privacy protection regime rests upon that distinction.
The General Data Protection Regulation (GDPR), applicable throughout the European Union and the European Economic Area, excludes from its scope “anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” To determine whether a data subject is identifiable, “account should be taken of all the means reasonably likely to be used … either by the controller or by another person to identify the natural person, directly or indirectly.” We highlight the wording reasonably likely.
De-identified data may no longer be considered irrevocably anonymous and therefore should not be released unconditionally, but the risk of re-identification is so remote that the data may be released under a governance structure that mitigates the residual privacy risk.
The United Kingdom’s ICO defines anonymized data as “data that does not itself identify any individual and that is unlikely to allow any individual to be identified through its combination with other data.” We highlight the phrase unlikely to allow any individual to be identified.
In the United States, even the most restrictive privacy law, the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule, excludes de-identified personal health information from its protection (section 164.514(a)). It defines health information as de-identified if it is not “individually identifiable,” that is, “if it does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual.”
Currently, Canadian privacy law protects only information about an identifiable individual. It means that information so effectively severed from any identifier, direct or indirect, that it is no longer “about an identifiable individual” is considered “made anonymous” and equated to “destroyed.” Consequently, it may be used without restriction.
But while lawyers may be smug in this purportedly obvious categorization predicated upon the assumption that de-identified data is anonymous, technologists no longer accept it, arguing that no method completely protects de-identified data from re-identification. A 2019 Imperial College London study, “Estimating the Success of Re-identifications in Incomplete Datasets Using Generative Models,” took a direct shot at the legal notion of anonymous data. It found that “even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.” The “release-and-forget model” is the process where the data is de-identified, disclosed and never traced back.
If we accept that de-identification leaves a remote risk of re-identification and that innovation needs data derived from personal data, as de-identified data is, “responsible innovation” offers two options. We can either accept the risk and adopt the pragmatic privacy protection approach described in the ICO’s Anonymisation: Managing Data Protection Risk Code of Practice, or we do not accept the risk and adopt a cautious approach, subjecting the release of de-identified data to a governance structure for stewardship to mitigate even remote privacy risks.
The United Kingdom’s ICO accepts the technological reality that complete and permanent anonymization may not be achievable but ponders the risk of re-identification against the benefits of data use for innovation. It recognizes anonymization where “the risk of identification … is remote.” In fact, on the strength of UK case law, the ICO expressly states that it does not require anonymization to be completely risk-free in favour of preponderant social benefits.
The cautious approach is exemplified in the increasing establishment of governance frameworks to release identified, pseudonymized and de-identified data for innovation under commensurate safeguards to mitigate risk.
This is where the data trust comes in.
Lawyers participating in the Open Data Institute report Data Trusts: Legal and Governance Considerations exclude de-identified data from protection where “all reasonable means (taking into account cost and technology) that someone is likely to use to identify a person including by ‘singling out’ (i.e. identifying a person other than by name or address)” have been applied. The data trust’s protection is therefore reserved for situations in which even the theoretical risk of re-identification exists. Anonymous data, where anonymization can be achieved, is released from the shackles of privacy interests to allow its free flow to foster innovation.
Data Trusts as Instruments of Innovation
Canada’s Digital Charter in Action: A Plan by Canadians, for Canadians puts data trusts at the service of three policy objectives. The first is “putting data to use for Canadians.” Data trusts are presented as governance structures for data sharing “to drive scientific discovery, to better understand the world, and to make informed policy decisions,” particularly “in areas such as health, clean technology or agri-business.”
The second is “keeping Canadians safe and secure.” Data trusts are put forward as governance mechanisms “to adapt principles-based law to particular sectors, activities, or technologies, and to make frameworks more agile.”
The third objective engaging data trusts is “supporting innovation.” In this context, data trusts serve to allow increased sharing of de-identified information for “strengthening Canada’s innovation ecosystem and promoting collaborative and inclusive growth.”
An abundance of practical applications show how data trusts can support innovation in the public interest. As an example of a bilateral data trust, the UK Allan Turing Institute uses data trusts to govern its use of specific personal data sets, provided by a public institution, in artificial intelligence research. Under data trust provisions, it uses the data sets in order to improve machine learning. The improved machine learning, in turn, produces better information to support public policies or private research. As an example of a multilateral data trust, the European Genome-phenome Archive (EGA) in Barcelona provides experts all over the world with access to personal data of healthy and ill individuals for research. It uses a “Data Use Ontology” to regulate access and use of the data. Twenty categories of access apply, from “No restriction” based on broad consent, to conditions of non-publication or, on the contrary, commitment to make the results of the research public. The EGA constitutes a remarkable engine of acceleration of innovation in medical research within strict privacy safeguards.
Other examples of data trust application include improving algorithms to detect fraud in financial services; clarifying insurance data; enhancing the precision and speed of medical diagnosis; and creating new digital health solutions, in general.
The key role of the data trust is that it allows the liberation of the data for the collective good, without harm to individual rights.
Data Trusts as Privacy Safeguards
Data trusts have been created precisely as “a way to broker privacy rights,” as Sean McDonald wrote in his 2019 article “Reclaiming Data Trusts.” The idea is to liberate data, essential to innovation, through a governance framework to ensure its protection. The Open Data Institute’s report Data Trusts: Lessons from Three Pilots explores the various modalities through which a data trust can provide privacy safeguards. For example, as with the EGA, the data trust can regulate access on the basis of granular consent options for usage. De-identified data can be further protected from the risk of re-identification through its transfer under the data trust, benefitting from both the stewardship of the data trust and the detachment from its source.
A data trust’s privacy stewardship is essentially guided by privacy law. At the very reception of the data, the trust assesses its admissibility: If it is used in its identified form, is it supported by valid consent, meaning with clarity on purposes and sharing, as well as with a consent withdrawal mechanism? If it is presented as de-identified, does it meet the legal criteria for de-identification? Once admitted in the database, identified data is pseudonymized, so effectively severing the identifiers from the substantive data that no identification is possible without the basic matrix that reconciles pseudonyms with identity. Pseudonymized data being personal data, it is made accessible exclusively on the basis of the modalities of consent attached to it through a system of access privileges on the model of the EGA Data Use Ontology. A system of access controls through passwords and authentication, monitored through audit trails, ensures compliance. Technological support is, of course, of critical importance.
The key role of the data trust is that it allows the liberation of the data for the collective good, without harm to individual rights
The point is that the purpose of the data trust is to make the data available for innovation. If the data is identified, it can only be used with consent. The Alan Turing Institute takes the approach of “donating personal data for research.” If the data is de-identified it may be used without consent, provided the necessary safeguards are applied to curate its access and use, through an independent and binding legal structure.
Making a Data Trust Work
Data trusts are legally binding structures created by law or by contract. They take the form appropriate to a specific application. A data trust may be an entity, a governance framework or a protocol, as appropriate to its purpose, to allow the enhanced sharing of personal data or data derived from personal data, through enhanced privacy protection. The approach has been coined as using “data for good.” Through the legal structure, one party authorizes another to make decisions about data access and data use on behalf of data users, the users being the wider group of stakeholders seeking to rely on the data for innovation in the public interest. While the “legal structure” is not defined and can take any form that is appropriate, it must necessarily be structured to exercise independent stewardship of the data with decision-making powers over the participating organizations. It serves to create a safe space for using and sharing personal and derived data for innovation in the public interest, while imposing all necessary privacy safeguards.
In their 2018 article “What Is a Data Trust?” Bianca Wylie and Sean McDonald proceed cautiously when assessing the value of data trusts. They venture that “when used for governance, data trusts can steward, maintain and manage how data is used and shared — from who is allowed access to it, and under what terms, to who gets to define the terms, and how. They can involve a number of approaches to solving a range of problems, creating different structures to experiment with governance models and solutions in an agile way.”
The Open Data Institute’s report on pilot data trusts distills the conditions for their success:
- data trusts must not be created on the basis of trust law, because it is inappropriate;
- the concept of data trusts may be inspired by the legal notion of legal trusts to manage common assets, but it is a metaphor, not a literal application;
- the stewardship structure of the data trusts should be validated by the privacy regulators;
- to be a trusted steward, a data trust needs power over the data users (that is, the entities or persons having access to the data held in the trust); and
- a data trust must first be built as a governance structure before the technological options to support it are chosen.
Accounts of practical experience in setting up a data trust, specifically in the private sector, reveal the following steps to take for success:
- Develop the ethical and legal framework that will govern the data sharing, integrating applicable law and applicable ethical considerations, such as fairness in access and use rights, acceptable purposes, or algorithmic transparency, as applicable, to counter algorithmic bias.
- Establish common aspirations for the data sharing: What is the purpose? What do we want to accomplish? Who do we want to serve? What social value are we pursuing?
- Agree upon legally binding rules and obligations that assert the independence of the trust through decision-making powers over the members and that dictate the specific rules of access, use and sharing of data.
- Constitute a governance framework that corresponds to the minimal sharing necessary for the purposes of the trust. Where the data must be accessible broadly, on a case-by-case basis of accreditation, the data will most likely have to be centralized, with strict access controls. In this case, the data trust may take the form of a separate entity. Where the data is meant to be shared only by a reduced number of member organizations, a federated architecture can leave the data within each member organization to be accessible by the other member organizations, according to the agreed legal framework. In this case, the data trust may take the form of a protocol or a contract. In both cases, the governance framework must clearly identify accountability and provide independent decision-making powers to the trust, binding upon the members.
- Apply appropriate technology to safeguard the data, addressing the level of sensitivity of the data and corresponding to the privacy risks of data interoperability. Blockchain is most often mentioned.
- Finally, adopt transparency mechanisms through auditing and activity reports providing statistics on databases, data use, access controls, outcomes, security threats and, as applicable, breaches.
As data trusts emerge around the world, supporting innovation with the governance structure to mitigate the privacy risks, Canada must focus on the legal space for innovation and privacy.
The Open Data Institute suggests that the concept of data trusts be further developed through regulatory sandboxes, which leads us to the second emerging governance instrument to foster innovation.
The Other Legal Space for Innovation: Regulatory Sandboxes
Currently, the concept of the regulatory sandbox is mostly used in the fintech industry, where it is defined as either, broadly, a mechanism to develop regulation that keeps up with innovation or, more specifically, a framework set up by a regulator that allows organizations to conduct live experiments in a controlled environment without fear of enforcement action.
In Canada, the Canadian Securities Administrators (CSA) have implemented a regulatory sandbox to create that “safe space,” so to speak, for fintech. The process allows the organization to submit innovative business models to the securities regulator and to discuss the applicable securities law issues, as well as the requirements for compliance.
The United Kingdom’s ICO Regulatory Sandbox is a service that supports all organizations in developing products and services using personal data for innovation in full respect of privacy rights. The organization engages the ICO “Sandbox team” to draw upon its expertise on mitigating risks and embedding privacy safeguards. Together, the ICO Sandbox team and the organization develop a common understanding of the regulatory framework and the necessary mechanisms for accountability. Beyond the benefit of improved products and services for the public, the sandbox exercise contributes to a dynamic economy and assists the regulator in its guidance of business.
Circling back to data trusts, the regulatory sandbox model could be used to establish a data trust with guidance from privacy regulators. The question is whether Bill C-11 creates the conditions for the innovation framework.
Bill C-11 Could Take Us Closer to Fostering Innovation While Protecting Privacy, but Doesn’t
Bill C-11 constitutes great progress in Canadian privacy law, but has one significant fault: it restricts the use and disclosure of de-identified data as being personal data, essentially leaving too little room for any use of data about humans to foster innovation.
Bill C-11 allows the disclosure of de-identified data without knowledge or consent only to specified public institutions or an organization mandated by such an institution, and for a “socially beneficial purpose.” A “socially beneficial purpose” is defined as “a purpose related to health, the provision or improvement of public amenities or infrastructure, the protection of the environment or any other prescribed purpose.” This limitation, a set back from current Canadian privacy law, begs a few questions: Why are only certain public institutions trusted with pursuing “socially beneficial purposes,” when we know that so much innovation in the public good comes from the private sector? Why are socially beneficial purposes so categorized rather than left to an aspirational principle that would encompass new purposes as needs dictate? Why, even with the admission, in the definition of “de-identify,” that the data “could not be used in foreseeable circumstances,” is it not sufficient to allow its release in favour of preponderant public interests? Why would the criminalization of re-identification in Bill C-11, subject to fines of up to $25 million, not be enough to loosen the noose on de-identified data? The corollary requirement of consent begs more questions: Whose consent, since the data is de-identified? If consent was possible, how could it be valid in the face of so many new purposes that would arise through research?
Confining the use of de-identified data to such narrow parameters overstates the risk of re-identification, understates the need of de-identified data for innovation, and ignores the potential of data trusts, independent and lawfully constituted, to address the remote residual privacy risks of using de-identified data.
Bill C-11 would put an end to necessary current practices of use and disclosure of de-identified data for system improvement and innovation. It would cripple innovation in Canada because it essentially removes all data derived from humans, even where it has undergone “technical processes to ensure that the information does not identify an individual or could not be used in reasonably foreseeable circumstances, alone or in combination with other information, to identify an individual,” to use its own terms. It would also displace innovation from Canada to the European Union and to the United States, where de-identified data is liberated.
The cautious approach to responsible innovation, which accepts the view of technologists that, through the sheer volume of personal data and the power of analytics, there is no totally anonymous data, does not restrict the purview of using de-identified data to merely a few public institutions, as if they had the monopoly on “socially beneficial purposes.” Rather, the cautious approach should be to support the creation of governance structures that protect data, whether identified or pseudonymized or de-identified, in accordance with the reality of privacy risks, considered against the need for innovation for good.
It is precisely the role of data trusts to enhance the sharing of personal or de-identified information through a governance structure and technological safeguards to, as the Open Data Institute puts it, “unlock the value of data while preventing harmful impact.”
To proceed with careful attention to the protection of the right to privacy, the design of data trusts could be the object of regulatory sandbox exercises where the rights of each of us are protected for the good of all of us. To foster innovation, de-identified data cannot be locked in a narrow exchange with just a few public institutions.