Big Data in bioinformatics
Matematičeskaâ biologiâ i bioinformatika, Tome 12 (2017) no. 1, pp. 102-119.

Voir la notice de l'article provenant de la source Math-Net.Ru

Sequencing of the human genome began in 1994. It took 10 years of collaborative work of many teams in order to obtain a draft of human DNA. Modern technology of sequencing allows one to read the individual genomes in a few days. Advances in modern bioinformatics related to the emergence of high-performance sequencing platforms, which not only contributed to the expansion of the capabilities of biology and related sciences, but also gave rise to the phenomenon of large data. In the paper the necessity of development of new technologies and methods for organization of storage, management, analysis and visualization of large data is substantiated. Modern bioinformatics has faced not only the problem of enormous volumes of heterogenous data, but also with a huge variety of processing and presentation methods, the existence of various software tools and data formats. The ways of solving the arising challenges are discussed in the paper, in particular by using achievements from other areas of modern life, such as web intelligence and business intelligence. New storage systems, other than relational ones, will help to solve the problem of archiving and ensuring an acceptable time for performing search queries. New programming technologies, namely generic programming and visual programming can help to overcome the problem of diversity of formats of genomic data and provide the ability to experimentators to quickly create scripts for data processing.
@article{MBB_2017_12_1_a0,
     author = {N. N. Nazipova and E. A. Isaev and V. V. Kornilov and D. V. Pervukhin and A. A. Morozova and A. A. Gorbunov and M. N. Ustinin},
     title = {Big {Data} in bioinformatics},
     journal = {Matemati\v{c}eska\^a biologi\^a i bioinformatika},
     pages = {102--119},
     publisher = {mathdoc},
     volume = {12},
     number = {1},
     year = {2017},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MBB_2017_12_1_a0/}
}
TY  - JOUR
AU  - N. N. Nazipova
AU  - E. A. Isaev
AU  - V. V. Kornilov
AU  - D. V. Pervukhin
AU  - A. A. Morozova
AU  - A. A. Gorbunov
AU  - M. N. Ustinin
TI  - Big Data in bioinformatics
JO  - Matematičeskaâ biologiâ i bioinformatika
PY  - 2017
SP  - 102
EP  - 119
VL  - 12
IS  - 1
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/MBB_2017_12_1_a0/
LA  - ru
ID  - MBB_2017_12_1_a0
ER  - 
%0 Journal Article
%A N. N. Nazipova
%A E. A. Isaev
%A V. V. Kornilov
%A D. V. Pervukhin
%A A. A. Morozova
%A A. A. Gorbunov
%A M. N. Ustinin
%T Big Data in bioinformatics
%J Matematičeskaâ biologiâ i bioinformatika
%D 2017
%P 102-119
%V 12
%N 1
%I mathdoc
%U http://geodesic.mathdoc.fr/item/MBB_2017_12_1_a0/
%G ru
%F MBB_2017_12_1_a0
N. N. Nazipova; E. A. Isaev; V. V. Kornilov; D. V. Pervukhin; A. A. Morozova; A. A. Gorbunov; M. N. Ustinin. Big Data in bioinformatics. Matematičeskaâ biologiâ i bioinformatika, Tome 12 (2017) no. 1, pp. 102-119. http://geodesic.mathdoc.fr/item/MBB_2017_12_1_a0/

[1] Manyika J., Chui M., Brown B., Bughin J., Dobbs R., Roxburgh C., Byers A. H., The Next Frontier for Innovation, Competition, and Productivity, McKinsey Global Institute, San Francisco, 2011 (data obrascheniya: 17.02.2017) http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation

[2] Jacobs A., “The Pathologies of Big Data”, Communications of the ACM, 52:8 (2009) | DOI

[3] What's New in Gartner's Hype Cycle for Emerging Technologies, , Gartner, 2015 (data obrascheniya: 17.02.2017) http://www.gartner.com/smarterwithgartner/whats-new-in-gartners-hype-cycle-for-emerging-technologies-2015/

[4] Chui M., Loffler M., Roberts R., The Internet of Things, , McKinsey Quarterly, 2010 (data obrascheniya: 17.02.2017) http://www.mckinsey.com/industries/high-tech/our-insights/the-internet-of-things

[5] Hogeweg P., “The Roots of Bioinformatics in Theoretical Biology”, PLOS Computational Biology, 7:3 (2011), e1002021 | DOI

[6] Winkler H., Verbreitung und Ursache der Parthenogenesis im Pflanzen - und Tierreiche, Verlag Fischer, Jena, 1920

[7] Baker M., “The 'Oms Puzzle”, Nature, 494 (2013), 416–419 | DOI

[8] Ohashi H., Hesegawa M., Wakimoto K., Miyamoto-Sato E., “Next-generation technologies for multiomics approaches including interactome sequencing”, BioMed Research International, 2015 (2015), 104209 | DOI

[9] “International Human Genome Sequencing Consortium. Human genome”, Nature, 409 (2001), 860–921 | DOI

[10] Venter J. C., Adams M. D., Myers E. W., Li P. W., Mural R. J., Sutton G. G., Smith H. O., Yandell M., Evans C. A., Holt R. A., et al., “The sequence of the human genome”, Science, 291:5507 (2001), 1304–1351 | DOI

[11] Buermans H. P. J., den Dunnen J. T., “Next generation sequencing technology. Advances and applications”, BBA — Molecular Basis of Disease, 1842:10 (2014), 1932–1941 | DOI

[12] Bioinforx Inc. Next Generation Sequencing Software, (data obrascheniya: 17.02.2017) http://bioinfo.wisc.edu/knowledge_base/next-gen-seq_software.php

[13] BaseSpace Sequence Hub, (data obrascheniya: 17.02.2017) https://www.illumina.com/content/dam/illumina-marketing/documents/products/datasheets/datasheet_basespace.pdf

[14] CLCBio, (data obrascheniya: 17.02.2017) http://www.clcbio.com

[15] DNASTAR Lasergene, (data obrascheniya: 17.02.2017) https://www.dnastar.com/t-allproducts.aspx

[16] Kearse M., Moir R., Wilson A., Stones-Havas S., Cheung M., Sturrock S., Buxton S., Cooper A., Markowitz S., Duran C., et al., “Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data”, Bioinformatics, 28:12 (2012), 1647–1649 | DOI

[17] Giardine B., Riemer C., Hardison R. C., Burhans R., Elnitski L., Shah P., Zhang Y., Blankenberg D., Albert I., Taylor J., et al., “Galaxy: a platform for interactive large-scale genome analysis”, Genome Res., 15:10 (2005), 1451–1455 | DOI

[18] Goecks J., Nekrutenko A., Taylor J., “Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences”, Genome Biol., 11:8 (2010), R86 | DOI

[19] Madduri R. K., Sulakhe D., Lacinski L., Liu B., Rodriguez A., Chard K., Dave U. J., Foster I. T., “Experiences Building Globus Genomics: A Next-Generation Sequencing Analysis Service using Galaxy, Globus, and Amazon Web Services”, Concurr. Comput., 26:13 (2014), 2266–2279 | DOI

[20] Wattam A. R., Abraham D., Dalay O., Disz T. L., Driscoll T., Gabbard J. L., Gillespie J. J., Gough R., Hix D., Kenyon R., et al., “PATRIC, the bacterial bioinformatics database and analysis resource”, Nucleic Acids Res., 42 (2014), D581–D591 | DOI

[21] Golosova O., Henderson R., Vaskin Y., Gabrielian A., Grekhov G., Nagarajan V., Oler A. J., Quinones M., Hurt D., Fursov M., Huyen Y., “Unipro UGENE NGS pipelines and components for variant calling, RNA-seq and ChIP-seq data analyses”, PeerJ., 2 (2014), e644 | DOI

[22] Okonechnikov K., Golosova O., Fursov M., “UGENE Team. Unipro UGENE: a unified bioinformatics toolkit”, Bioinformatics, 28:8 (2012), 1166–1167 | DOI

[23] Jagla B., Wiswedel B., Coppree J.-Y., “Extending KNIME for next-generation sequencing data analysis”, Bioinformatics, 27:20 (2011), 2907–2909 | DOI

[24] Warr W. A., “Scientific workflow systems: Pipeline Pilot and KNIME”, Journal of Computer-Aided Molecular Design, 26:7 (2012), 801–804 | DOI

[25] Oinn T., Addis M., Ferris J., Marvin D., Senger M., Greenwood M., Carver T., Glover K., Pocock M. R., Wipat A., Li P., “Taverna: a tool for the composition and enactment of bioinformatics workflows”, Bioinformatics, 20:17 (2004), 3045–3054 | DOI

[26] Barnett D. W., Garrison E. K., Quinlan A. R., Stromberg M. P., Marth G. T., “BamTools: a C++ API and toolkit for analyzing and managing BAM files”, Bioinformatics, 27:12 (2011), 1691–1692 | DOI

[27] Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., “1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools”, Bioinformatics, 25:16 (2009), 2078–2079 | DOI

[28] Nordell Markovits A., Joly Beauparlant C., Toupin D., Wang S., Droit A., Gevry N., “NGS++: a library for rapid prototyping of epigenomics software tools”, Bioinformatics, 29:15 (2013), 1893–1894 | DOI

[29] Plieskatt J., Rinaldi G., Brindley P. J., Jia X., Potriquet J., Bethony J., Mulvenna J., “Bioclojure: a functional library for the manipulation of biological sequences”, Bioinformatics, 30:17 (2014), 2537–2539 | DOI

[30] libStatGen, (data obrascheniya: 17.02.2017) https://github.com/statgen/libStatGen/

[31] Pitt W. R., Williams M. A., Steven M., Sweeney B., Bleasby A. J., Moss D. S., “The Bioinformatics Template Library — generic components for biocomputing”, Bioinformatics, 17:8 (2001), 729–737 | DOI

[32] Stajich J. E., Block D., Boulez K., Brenner S. E., Chervitz S. A., Dagdigian C., Fuellen G., Gilbert J. G., Korf I., Lapp H., et al., “The Bioperl toolkit: Perl modules for the life sciences”, Genome Res., 12:10 (2002), 1611–1618 | DOI

[33] Goto N., Prins P., Nakao M., Bonnal R., Aerts J., Katayama T., “BioRuby: bioinformatics software for the Ruby programming language”, Bioinformatics, 26:20 (2010), 2617–269 | DOI

[34] Holland R. C., Down T. A., Pocock M., Prlic A., Huen D., James K., Foisy S., Drager A., Yates A., Heuer M., et al., “BioJava: an open-source framework for bioinformatics”, Bioinformatics, 24:18 (2008), 2096–2097 | DOI

[35] Cock P. J., Antao T., Chang J. T., Chapman B. A., Cox C. J., Dalke A., Friedberg I., Hamelryck T., Kauff F., Wilczynski B., et al., “Biopython: freely available Python tools for computational molecular biology and bioinformatics”, Bioinformatics, 25:11 (2009), 1422–1423 | DOI

[36] Open Bioinformatics Foundation, (data obrascheniya: 17.02.2017) https://www.open-bio.org/wiki/Main_Page

[37] Huber W., Carey V. J., Gentleman R., Anders S., Carlson M., Carvalho B. S., Bravo H. C., Davis S., Gatto L., Girke T., et al., “Orchestrating high-throughput genomic analysis with Bioconductor”, Nat. Methods, 12:2 (2015), 115–121 | DOI

[38] Gentleman R. C., Carey V. J., Bates D. M., Bolstad B., Dettling M., Dudoit S., Ellis B., Gautier L., Ge Y., Gentry J., et al., “Bioconductor: open software development for computational biology and bioinformatics”, Genome Biol., 5:10 (2004), R80 | DOI

[39] Milicchio F., Rose R., Bian J., Min J., Prosperi M., “Visual programming for next-generation data analytics”, BioData Mining, 9 (2016), 16 | DOI

[40] Bernstein F. C., Koetzle T. F., Williams G. J., Meyer E. F. Jr., Brice M. D., Rodgers J. R., Kennard O., Shimanouchi T., Tasumi M., “The Protein Data Bank: a computer-based archival file for macromolecular structures”, J. Mol. Biol., 112:3 (1977), 535–542 | DOI

[41] Bourne P. E., Berman H. M., McMahon B., Watenpaugh K. D., Westbrook J. D., Fitzgerald P. M. D., “Macromolecular crystallographic information file”, Methods in Enzymology, 277 (1997), 571–590 | DOI

[42] Galperin M. Y., Fernandez-Suarez X. M., Rigden D. J., “The 24th annual Nucleic Acids Research database issue: a look back and upcoming changes”, Nucleic Acids Res., 45 (2017), D1–D11 | DOI

[43] Benson D., Lipman D. J., Ostell J., “GenBank”, Nucleic Acids Res., 22 (1994), 3441–3444 | DOI

[44] Rice C. M., Fuchs R., Higgins D. G., Stoehr P. J., Cameron G. N., “The EMBL Data Library”, Nucleic Acids Res., 21 (1993), 2967–2971 | DOI

[45] Tateno Y., Gojobori T., “DNA Data Bank of Japan in the age of information biology”, Nucleic Acids Res., 25:1 (1997), 14–17 | DOI

[46] de Brevern A. G., Meyniel J.-P., Fairhead C., Neuveglise C., Malpertuy A., “Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data Produced by NGS Technologies”, BioMed Research International, 2015, 904541

[47] Lith A., Mattsson J., Investigating Storage Solutions for Large Data. A comparison of well performing and scalable data storage solutions for real time extraction and batch insertion of data, Master of Science Thesis, 2010 (data obrascheniya: 17.02.2017) http://publications.lib.chalmers.se/records/fulltext/123839.pdf

[48] Svensson J., Relational vs. graph databases: Which to use and when?, SD Times, 2016 (data obrascheniya: 17.02.2017) http://sdtimes.com/guest-view-relational-vs-graph-databases-use/#sthash.yHI6aoDv.dpuf

[49] Have C. T., Jensen L. J., Are graph databases ready for bioinformatics?, Bioinformatics, 29:24 (2013), 3107–3108 | DOI

[50] Taylor R. C., “An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics”, BMC Bioinformatics, 11 (2010), S1 | DOI

[51] Chang F., Dean J., Ghemawat S., Hsieh W. C., Wallach D. A., Burrows M., Chandra T., Fikes A., Gruber R. E., “Bigtable: A Distributed Storage System For Structured Data”, The 7th Symposium on Operating System Design and Implementation, Usenix Association, Seattle, WA, 2006, 14 pp. (data obrascheniya: 17.02.2017) https://static.googleusercontent.com/media/research.google.com/ru//archive/bigtable-osdi06.pdf

[52] Shen L., Shao N., Liu X., Nestler E., “Ngs.plot: quick mining and visualization of next-generation sequencing data by integrating genomic databases”, BMC Genomics, 15:1 (2014), 284 | DOI

[53] Robinson J. T., Thorvaldsdóttir H., Winckler W., Guttman M., Lander E. S., Getz G., Mesirov J. P., “Integrative genomics viewer”, Nature Biotechnology, 29:1 (2011), 24–26 | DOI

[54] Toedling J., Ciaudo C., Voinnet O., Heard E., Barillot E., “Girafe — an R/Bioconductor package for functional exploration of aligned next-generation sequencing reads”, Bioinformatics, 26:22 (2010), 2902–2903 | DOI

[55] Nolan D., Lang D. T., “Interactive and animated scalable vector graphics and R data displays”, Journal of Statistical Software, 46:1 (2012), 1–88 | DOI

[56] TIBCO Spotfire Homepage, (data obrascheniya: 17.02.2017) http://spotfire.tibco.com/

[57] Wexler J., Thompson W., Aponte K., “Time Is Precious, So Are Your Models. SAS provides solutions to streamline deployment”, SAS Global Forum 2013, 086-2013 (data obrascheniya: 17.02.2017) https://support.sas.com/resources/papers/proceedings13/086-2013.pdf

[58] Tanenbaum E., van Steen M., Raspredelennye sistemy. Printsipy i paradigmy, Piter, S.-P., 2003, 877 pp.

[59] Dean J., Ghemawat S., “MapReduce: simplified data processing on large clusters”, Commun. ACM, 51:1 (2008), 107–113 | DOI

[60] White T., Hadoop: The Definitive Guide, O'Reilly Media, Inc., 2015, 756 pp.

[61] The Apache Software Foundation Home page, (data obrascheniya: 17.02.2017) http://www.apache.org/

[62] IBM z Systems — z13s, (data obrascheniya: 17.02.2017) http://www-03.ibm.com/systems/z/hardware/z13s.html/

[63] Rustici G., Kolesnikov N., Brandizi M., Burdett T., Dylag M., Emam I., Farne A., Hastings E., Ison J., Keays M., et al., “ArrayExpress update — trends in database growth and links to data analysis tools”, Nucleic Acids Res., 41 (2013), D987–D990 | DOI

[64] Greene A. C., Giffin K. A., Greene C. S., Moore J. H., “Adapting bioinformatics curricula for big data”, Briefings in Bioinformatics, 17:1 (2016), 43–50 | DOI

[65] Margolis R., Derr L., Dunn M., Huerta M., Larkin J., Sheehan J., Guyer M., Green E. D., “The National Institutes of Health's Big Data to Knowledge (BD2K) initiative: capitalizing on biomedical big data”, J. Am. Med. Inform. Assoc., 21 (2014), 957–958 | DOI

[66] Luo J., Wu M., Gopukumar D., Zhao Y., “Big Data Application in Biomedical Research and Health Care: A Literature Review”, Biomed. Inform. Insights., 8 (2016), 1–10