Crowdsourcing Genomes: Why Open Source is Vital for Future Genetic Research
By Siddharth Reed || January 26 2018
"The hardest problems of pure and applied science can only be solved by the open collaboration of the world-wide scientific community."
- Kenneth G. Wilson
- Kenneth G. Wilson
Science has always been best when it is collaborative. The Human Genome Project was an international collaboration among researchers to map all of the genes in the human genome. It was meant to give scientists a better understanding of what genes exist in humans and how they affect disease. It was conducted by American geneticist Francis Collins, with support from the U.S. Department of Energy and the National Institutes of Health (NIH) (6). They were able to map over 70% of the reference genome from a single donor, but to study DNA variation they required over 270 donors (7). One of most important parts of the project was that all of the data was made publicly available for free through the internet, anyone with a computer could access the whole project (8). As scientists attempt to elucidate the mysteries of the human genome, increasing amounts of sequence data will be required to make statistically valid and replicable discoveries. To meet this growing need for data, scientists and corporations need to embrace the philosophy of open source.
What is an Open Source philosophy? This term refers to data or software that is available to anyone. Any individual can use or modify open source software or data in whatever way they choose. Consider for example, Mozilla Firefox is an open source web browser, anyone can modify it to suit their own needs and view and contribute to the source code (all of the code necessary for the Firefox) (2). BioMart is an open source database network with data regarding DNA sequences, DNA activity, protein structure among many other things accessible for free to anyone (1). Bioconductor is an open source software platform for Bioinformatic data and software (3). All of these tools and data sets are available for free to anyone who wants to use them, provided they have access to a computer. They all foster communities around people who not only use these projects for their informational value, but also contribute to them in whatever manner they are able to, either by adding data to a database or contributing to software tools by adding features or improving functionality.
As computational power increases and more scientists become trained in informatics technologies, the availability of good data becomes the bottleneck to scientific progress. Therefore, not only the collection, but the sharing and accessibility of data should be growing on par with bioinformatics as a field. Selection bias, bias that results from not having a sample representative of the population you are studying, becomes much more apparent and damaging without enough data to avoid it. The more data that is available to researchers, the more they are able to avoid bias in their resulting analysis. Due to the advent of the internet, vast amounts of data can be shared instantaneously. People can collaborate all over the world with access to the exact same results and tools, with very little apprehension about experimental error and reproducibility. Individually labs are universities often cannot generate enough data themselves to provide insights on poorly understood genetic disorders (5). Having access to the sequences of potentially every person diagnosed would be an enormous stride in determining the root of these diseases (5). Some diseases are so rare and complex that they need millions of genomes to even begin to understand their genetic basis (5).
Bioinformatic research is only as good as the information that is available. There are still hurdles to cross in this global sharing of data. Genetic sequencing data is vast, a single genome can be over 80 GB (4). Transmitting sizeable samples for bioinformatic analysis is not incredibly easy, but it is more efficient than sending hard drives across the world via courier. Ensuring that patient privacy is not violated is another challenge when sourcing data from so many different sources. Patients must be informed about the nature of sharing their genetic sequences, and how many people can have access to it.
It becomes more important than ever that integrity, accuracy and privacy be maintained in the collection of biological data, something that is already the focus of many wet labs. When it comes to biological data, sharing really is caring.
References
- https://academic.oup.com/database/article/doi/10.1093/database/bar041/467299/BioMart-Central-Portal-an-open-database-network
- https://www.mozilla.org/en-US/mission/
- https://www.bioconductor.org/
- http://massgenomics.org/2014/11/brace-yourself-for-large-scale-whole-genome-sequencing.html
- https://www.technologyreview.com/s/535016/internet-of-dna/
- https://www.britannica.com/event/Human-Genome-Project
- http://www.nature.com/ng/journal/v37/n7/full/ng1562.html
- https://report.nih.gov/NIHfactsheets/ViewFactSheet.aspx?csid=45