Simulation modelling of single nucleotide genetic polymorphisms
Journal of the Belarusian State University. Mathematics and Informatics, Tome 2 (2024), pp. 104-112.

Voir la notice de l'article provenant de la source Math-Net.Ru

We propose an approach for the identification of single nucleotide polymorphisms (SNPs) in DNA sequences, based on the simulation modelling of sites of single nucleotides using the generation of random events according to the beta or normal distributions, the parameters of which are estimated from the available experimental data. The developed approach improves the accuracy of determining SNPs in DNA molecules and permits to investigate the reliability of specific experiments as well as to estimate the errors of determination of the parameters obtained in real experimental conditions. The verification of the simulation model and analysis methods is carried out on a set of reference human genomic DNA sequencing data provided by the Genome in a Bottle Consortium. The comparative analysis of the existing statistical SNP identification algorithms and machine learning methods, trained on the simulated data from the genomic sequencing of human DNA molecules, is carried out. The best results are obtained for machine learning models, in which the accuracy of SNP identification is $2-5 \%$ higher than for classical statistical methods.
Keywords: single nucleotide polymorphism; SNP; SNP identification; simulation modelling; machine learning
@article{BGUMI_2024_2_a8,
     author = {N. N. Yatskou and V. V. Apanasovich and V. V. Grinev},
     title = {Simulation modelling of single nucleotide genetic polymorphisms},
     journal = {Journal of the Belarusian State University. Mathematics and Informatics},
     pages = {104--112},
     publisher = {mathdoc},
     volume = {2},
     year = {2024},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/BGUMI_2024_2_a8/}
}
TY  - JOUR
AU  - N. N. Yatskou
AU  - V. V. Apanasovich
AU  - V. V. Grinev
TI  - Simulation modelling of single nucleotide genetic polymorphisms
JO  - Journal of the Belarusian State University. Mathematics and Informatics
PY  - 2024
SP  - 104
EP  - 112
VL  - 2
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/BGUMI_2024_2_a8/
LA  - ru
ID  - BGUMI_2024_2_a8
ER  - 
%0 Journal Article
%A N. N. Yatskou
%A V. V. Apanasovich
%A V. V. Grinev
%T Simulation modelling of single nucleotide genetic polymorphisms
%J Journal of the Belarusian State University. Mathematics and Informatics
%D 2024
%P 104-112
%V 2
%I mathdoc
%U http://geodesic.mathdoc.fr/item/BGUMI_2024_2_a8/
%G ru
%F BGUMI_2024_2_a8
N. N. Yatskou; V. V. Apanasovich; V. V. Grinev. Simulation modelling of single nucleotide genetic polymorphisms. Journal of the Belarusian State University. Mathematics and Informatics, Tome 2 (2024), pp. 104-112. http://geodesic.mathdoc.fr/item/BGUMI_2024_2_a8/

[1] W. K. Sung, Algorithms for next-generation sequencing. 1st edition, Chapman and Hall/CRC, New York, 2017, +364 pp. | DOI

[2] M. Kappelmann-Fenzl, Next generation sequencing and data analysis. 1st edition, Springer, Cham, 2021, +218 pp. | DOI

[3] X. L. Wu, J. Xu, G. Feng, G. R. Wiggans, J. F. Taylor, J. He, “Optimal design of low-density SNP arrays for genomic prediction: algorithm and applications”, PLoS ONE, 11(9) (2016), e0161719 | DOI

[4] W. Korani, J. P. Clevenger, Y. Chu, P. Ozias-Akins, “Machine learning as an effective method for identifying true single nucleotide polymorphisms in polyploid plants”, Plant Genome, 12(1) (2019), 180023 | DOI

[5] A. Masoudi-Nejad, Z. Narimani, N. Hosseinkhan, Next generation sequencing and sequence assembly. Methodologies and algorithms. 1st edition, Springer, New York, 2013, +86 pp. | DOI

[6] Z. Su, J. Marchini, P. Donnelly, “HAPGEN2: simulations of multiple disease SNPs”, Bioinformatics, 27(16) (2011), 2304–2305 | DOI

[7] J. H. Oh, J. O. Deasy, “SITDEM: a simulation tool for disease/endpoint models of association studies based on single nucleotide polymorphism genotypes”, Computers in Biology and Medicine, 45 (2014), 136–142 | DOI

[8] H. V. Meyer, E. Birney, “PhenotypeSimulator: a comprehensive framework for simulating multi-trait, multi-locus genotype to phenotype relationships”, Bioinformatics, 34(17) (2018), 2951–2956 | DOI

[9] A. E. Hendricks, J. Dupuis, M. Gupta, M. W. Logue, K. L. Lunetta, “A comparison of gene region simulation methods”, PLoS ONE, 7(7) (2012), e40925 | DOI

[10] B. Peng, H. S. Chen, L. E. Mechanic, B. Racine, J. Clarke, L. Clarke, “Genetic Simulation Resources: a website for the registration and discovery of genetic data simulators”, Bioinformatics, 29(8) (2013), 1101–1102 | DOI

[11] B. Peng, H. S. Chen, L. E. Mechanic, B. Racine, J. Clarke, E. Gillanders, “Genetic data simulators and their applications: an overview”, Genetic Epidemiology, 39(1) (2015), 2–10 | DOI

[12] M. M. Yatskou, V. V. Apanasovich, “Simulation modelling and machine learning platform for processing fluorescence spectroscopy data”, Pattern Recognition and Information Processing. PRIP-2021, Springer, Cham, 2022, 178–190 | DOI

[13] L. Jacquin, T. V. Cao, C. Grenier, N. Ahmadi, “DHOEM: a statistical simulation software for simulating new markers in real SNP marker data”, BMC Bioinformatics, 16 (2015), 404 | DOI

[14] A. U. Volkau, M. M. Yatskou, V. V. Grinev, “Selecting informative features of human gene exons”, Journal of the Belarusian State University. Mathematics and Informatics, 1 (2019), 77–89 | DOI

[15] Silun. Xu, V. V. Skakun, “Comparative analysis of deep learning neural networks for the segmentation of cancer cell nuclei on immunohistochemical fluorescent images”, Journal of the Belarusian State University. Mathematics and Informatics, 1 (2024), 59–70

[16] V. V. Grinev, M. M. Yatskou, V. V. Skakun, M. V. Chepeleva, P. V. Nazarov, “ORFhunteR: an accurate approach to the automatic identification and annotation of open reading frames in human mRNA molecules”, Software Impacts, 12 (2022), 100268 | DOI

[17] T. Hothorn, K. Hornik, A. Zeileis, “Unbiased recursive partitioning: a conditional inference framework”, Journal of Computational and Graphical Statistics, 15(3) (2006), 651–674 | DOI

[18] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and regression trees. 1st edition, Wadsworth International Group, Wadsworth, 1984, +358 pp.

[19] V. N. Vapnik, The nature of statistical leaning theory. 2nd edition, Springer, New York, 2000, 314 pp. | DOI

[20] K. P. Murphy, Probabilistic machine learning [Internet], The MIT Press, London, 2022, +864 pp.

[21] R-Core-Team, R: a language and environment for statistical computing. R foundation for statistical computing [Internet], Vienna, 2021 | DOI

[22] J. M. Zook, J. McDaniel, N. D. Olson, J. Wagner, H. Parikh, H. Heaton, “An open resource for accurately benchmarking small variant and reference calls”, Nature Biotechnology, 37(5) (2019), 561–566 | DOI

[23] Y. Liao, G. K. Smyth, W. Shi, “The R-package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads”, Nucleic Acids Research, 47(8) (2019), e47 | DOI

[24] M. M. Yatskou, E. V. Smolyakova, V. V. Skakun, V. V. Grinev, “Entropy-based detection of single-nucleotide genetic polymorphism sites”, AN Sevchenko Institute of Applied Physical Problems of Belarusian State University. Proceedings of the 7th International scientific-practical conference «Applied problems of optics, informatics, radiophysics and condensed matter physics» (Minsk, Belarus), Belarusian State University, Minsk, 2023, 191–193