Last month, 70 health care, research and disease advocacy groups announced the formation of a “global alliance” to foster the responsible sharing of genomic and clinical data. The groups made their announcement with the release of a 34 page white paper, which outlines the opportunity and scope of the proposed alliance.
The global alliance aims to tackle a problem we have simply never had before, namely millions of people will soon have their genome sequenced, but we have no standards for sharing genomic or clinical data. With widely adopted standards, researchers could mine integrated data sets to make novel scientific discoveries and improve human health.
To capitalize on the opportunity, the global alliance will be established as a not-for-profit association (modeled after the World Wide Web Consortium W3C), that will develop common, open and ethically responsible standards for sharing genomic and clinical data. Multiple prominent organizations are also behind the effort, including the Broad Institute, the American Association for Cancer Research (AACR), Cancer Research UK, the U.S. National Institutes of Health and the Wellcome Trust Sanger Institute.
At this point, each of the member organizations has signed a “letter of intent” to join the global alliance and create the not-for-profit association. However, it’s important to note that the letters are “non-binding”, meaning that no organization has actually agreed to share any data or adopt any specific protocol for sharing. There is no shared database anywhere. And, despite the considerable press (New York Times, Bloomberg, Guardian, and Nature News), and even a White House commendation, the alliance has not yet exchanged a single byte of data.
For now, the significant event is that all these organizations agree that there is a opportunity, even if they may not (yet) agree to a solution. There is also no other single body (national or international) tackling these issues, and creating a forum where ideas can be freely exchanged will be critically important.
As the alliance gets off the ground, it will need to contend with multiple conflicting challenges. The most prominent of these challenges are outlined below.
Challenge 1: Sample Size
According to the global alliance white paper, “by aggregating and analyzing large amounts of genomic and clinical data, it should be possible to discover patterns that would otherwise remain obscure… data from millions of samples will be needed”.
Another recent white paper from David Haussler et al. has posited the creation of a 1 million cancer genome warehouse, arguing that without such a “large, aggregated database we lack statistical power.” The “Big Data” movement has also had a considerable impact on the debate regarding sample size. A paper by Google researchers on “The Unreasonable Effectiveness of Data” – which has become a oft-cited manifesto for “Big Data” – makes a convincing case that “web-scale” data sets enable new computational insights, and that “simple models and lots of data trump more elaborate models based on less data.”
Nonetheless, genomic data does not come without risks (see next section), and it is therefore reasonable to ask how many samples we need to share in order to make meaningful progress against specific human diseases.
In the realm of cancer genomics, we also have at least eight years of initial data points. For example, next year, the Cancer Genome Atlas (TCGA) project wraps up its goals of profiling 20 cancer types with at least 500 samples each. Despite the herculean efforts of TCGA, critics have already started to make the case that it and other large-scale sequencing efforts have actually yielded surprisingly few clinically actionable results. For example, Michael Yaffe of MIT has argued that large-scale sequencing has been “pretty disappointing”, and that we have “learned little regarding cancer treatment that we did not already know.” Others, including Bert Vogelstein of Johns Hopkins University have argued that cancer genomics is “plateauing”, and that the same cancer drivers “keep being ‘rediscovered’ in different cancer types.” Even several members of the NCI’s own Board of Scientific Advisors, many of whom have been strong proponents of TCGA, have spoken of “diminishing returns” from more cancer genomes .
As discussed in a previous blog post, not everyone agrees with these assessments. But, it is a critically important debate, because individual patients and society at large need to make a calculated trade off between potential rewards and risks. One of the first main challenges of the global alliance is to therefore clearly identify areas of human health where sharing of integrating data sets can result in meaningful medical progress. Ideally, the alliance can help prioritize specific areas of human health, quantify sample sizes where progress can be made, and work to integrate this information into the informed consent process.
Without clear priorities and quantitative constraints, the global alliance could put itself on a large-scale dragnet approach to integrating genomes with very little hope of medical progress, but very real risk to genomic privacy.
Challenge 2: Privacy
Regardless of how you may feel about Edward Snowden and recent revelations regarding the NSA, the “Snowden affair” has fundamentally altered the debate regarding personal information and privacy within this country.
It has done so in two specific ways. First, it has raised the specter of people using your personal data (genomic or otherwise) in ways you never imagined, causing a segment of the public to grow increasingly suspicious of large-scale data integration efforts. Second, it has reinforced the reality that no system, even the NSA, is ever 100% secure. In fact, insiders may represent the largest security threat, and Steven Brenner of U.C. Berkeley has recently pondered how long we have until the genomics community faces its own Snowden: “an idealistic and technically literate researcher” who deliberately “releases genome and trait information publicly in the name of science.” 
Even before the Snowden affair, however, genomic privacy was already set to be the greatest challenge faced by the global alliance. The challenge arises because there is an inherent and perhaps insurmountable conflict between the sharing of genomic data and the protection of patient privacy. As one commentator in PLoS genetics has written: “making data available to many intelligent minds maximizes the likelihood that benefits of research will rapidly be returned to society, but also maximizes opportunity for breaching the duty of privacy to research participants.” 
There are also some, including most famously, George Church of Harvard, who believe that it is futile and even dishonest to promise participants that their genomic data can be kept under lock and key. And, recent efforts at patient genome re-identification back up his point .
As Steven Brenner has recently outlined, there are two extreme and logically coherent solutions to the problem of ensuring patient privacy. The first is to take the path currently being followed by the Personal Genome Project and incorporate or at least anticipate data release from the outset; the second is “lock genomes down so tightly that they are virtually impossible to steal.” 
The informed consent process pioneered by the Personal Genome Project has a certain appeal – “subjects are recruited and consented based on the expectation of full public data release” , and every participant has to pass a test to ensure that they truly understand what they are getting themselves into. But, as I outline in the next challenge, this process also has its inherent problems, and it is doubtful that it can realistically work on a very large scale. By contrast, the second option would obviously work to maximize patient privacy, but the barriers to research could be so substantial as to thwart progress.
The genomic alliance is therefore left to try and stake out a murky, but critical middle ground. Doing so is likely to invite criticism from multiple fronts, but there is a large swath of middle ground ideas that the alliance can and should evaluate: uniform, possibly mandated, training for all researchers regarding genomic privacy; the development of an ethical code of conduct for data scientists who wish to access shared repositories; evaluation of cloud-based options which record all activities and provide computational time, but prevent the downloading of data sets; open and uniform security procedures and audits for all institutions handling shared genomic data; and development of rapid mechanisms for notifying participants of genomic leaks.
Lastly, the breaching of genomic privacy is still largely uncharted legal territory. In the U.S., citizens are ostensibly protected by the Genetic Information Nondiscrimination Act (GINA) of 2008. Nonetheless, many experts have expressed their concern that GINA does not go far enough, and that release of genomic data “might hurt people’s employability, insurability or even personal relationships.” Of course, loss of privacy in and of itself is harm – just imagine if all your emails were suddenly made public. Beyond data protocols, the genomic alliance can therefore serve as an important voice to strengthen existing genomic discrimination laws and furthermore, help clarify laws regarding the deliberate leakage of genomic information, such that offenders can be successfully prosecuted.
Challenge 3: Equity
The final, big challenge likely to face the global alliance is equity: how, specifically to ensure that the fruits of genomic research flow to all segments of society. I readily admit that the open, informed consent pioneered by the Personal Genome Project (PGP) has a certain logical, ethical and honest appeal. According to the PGP, there is no way to ensure genomic confidentiality, and PGP participants can only enroll with this full understanding. According to the original proposal, “initial participants should be diverse, yet very familiar with research on human subjects, genetics, information technology and ELSI [Ethical, Legal, and Social Implications].”  Would-be participants even need to take a test and score 100% to prove their understanding.
As logically alluring as it sounds, the Personal Genome Project is akin to the Platonic ideal, great in theory, but one in which only the “Philosopher Kings” are eligible to participate. Furthermore, as Francis Collins, now director of the NIH has previously argued, “[projects, such as the PGP] might actually introduce bias or focus specifically on people with … a lot of resources because those are often the people who are least worried about the [genomic] discrimination issue.” The PGP has even openly admitted to its bias in subject participation, noting in a recent blog post that “we don’t have a very balanced set of participants… to put it bluntly, that means we mostly end up with young white men.” 
This is not just about participation. It is also about equity of medical benefit. If we focus on genomically profiling only one segment of society, there is a real possibility that medical benefits likewise only flow back to these same people. We already have enough inequities in U.S. healthcare. Let’s not compound the problem by reserving genomics for a select few.
Much Work Ahead
As David Altshuler of the Broad Institute and one of the principal organizers of the global alliance, has already acknowledged, there are “years of work ahead.” None of these issues are easy, all are all critically important, and this is just the beginning. Let the debate begin.
 Global Alliance, Creating a Global Alliance to Enable Responsible Sharing of Genomic and Clinical Data. Reference.
 David Haussler, David A. Patterson, Mark Diekhans, Armando Fox, Michael Jordan, Anthony D. Joseph, Singer Ma, Benedict Paten, Scott Shenker, Taylor Sittler and Ion Stoica, A Million Cancer Genome Warehouse. Reference.
 Alon Halevy Google, Peter Norvig, and Fernando Pereira, The Unreasonable Effectiveness of Data. IEEE Intelligent Systems, Volume 24 Issue 2, March 2009, Pages 8-12. Reference.
 Michael B. Yaffe, The Scientific Drunk and the Lamppost: Massive Sequencing Efforts in Cancer Discovery and Treatment. Sci. Signal., 2 April 2013. Reference.
 Bert Vogelstein, Nickolas Papadopoulos, Victor E. Velculescu, Shibin Zhou, Luis A. Diaz Jr., Kenneth W. Kinzler, Cancer Genome Landscapes. Science 29 March 2013. Reference.
 Jocelyn Kaiser, Ready for More 10,000 Cancer Genomes Projects? Science Insider. 6 March 2013. Reference.
 Steven E. Brenner, Be prepared for the big genome leak. Nature. 12 June 2013. Reference.
 P3G Consortium, Church G, Heeney C, Hawkins N, de Vries J, et al. (2009) Public Access to Genome-Wide Data: Five Views on Balancing Research with Privacy and Protection. PLoS Genet 5(10). Reference.
 For recent examples, see: Homer N, et al., Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays, PLOS Genetics, 2008 Aug 29;4(8); and Gymrek M, et al., Identifying personal genomes by surname inference. Science, 2013 Jan 18;339.
 George Church, The Personal Genome Project, Molecular Systems Biology 2005. Reference.
 Erika Check Hayden, Privacy protections: The genome hacker, Nature. 9 May, 2013. Reference.
 Francis Collins, NIH Town Hall Meeting, December 2006, as quoted in Misha Angrist, Eyes wide open: the personal genome project, citizen science and veracity in informed consent. Future Medicine, 2009. Reference.
 Madeleine Price Ball, Seeking Diversity (Especially Families). Blog of the Personal Genome Project. Reference
 Jocelyn Kaiser, Q&A: David Altshuler on How to Share Millions of Human Genomes. Science Insider. 7 June 2013. Reference.