Principal components of genetic sequences: correlations and significance
Matematičeskaâ biologiâ i bioinformatika, Tome 16 (2021) no. 2, pp. 299-316.

Voir la notice de l'article provenant de la source Math-Net.Ru

Any numerical series can be decomposed into principal components using singular spectral analysis. We have recently proposed a new analysis method – PCA-Seq, which allows calculating numerical principal components for a sequence of elements of any type. In particular, the sequence may be composed of nucleotide base pairs or amino acid residues. Two questions inevitably arise about interpretation of the obtained principal components and about the assessment of their reliability. For interpretation of the symbolic sequence principal components, it is reasonable to evaluate their correlations with numerical characteristics of the sequence elements. To assess the significance of correlations between sequences, one should bear in mind that standard significance criteria are based on the assumption of independence of observations, which, as a rule, is not fulfilled for real sequences. The article discusses the use of an anchor bootstrap technique for these purposes also previously developed by the authors of the article. In this approach it is assumed, that points of a metric space can represent the objects. When taken together they make up some fixed structure in it, in particular, a sequence. The objects are assigned the same random integer weights as in the classical bootstrap. This is sufficient to obtain the bootstrap distribution of the correlation coefficients and assess their significance. The coding sequence of the SLC9A1 gene (synonyms APNH, NHE1, PPP1R143) were taken as an example of use the anchor bootstrap technique in the genetic sequence analysis. Significant correlations of the first principal component were revealed with the hydrophobicity/“transmembraneity” of the corresponding fragments of the amino acid sequence, the phenylalanine content in them, as well as the difference in the T- and A-content in the corresponding nucleotide fragments. Earlier a similar pattern was found by other authors for other genes. Very likely, that it is of a more general nature.
@article{MBB_2021_16_2_a1,
     author = {V. M. Efimov and K. V. Efimov and V. Yu. Kovaleva and Yu. G. Matushkin},
     title = {Principal components of genetic sequences: correlations and significance},
     journal = {Matemati\v{c}eska\^a biologi\^a i bioinformatika},
     pages = {299--316},
     publisher = {mathdoc},
     volume = {16},
     number = {2},
     year = {2021},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MBB_2021_16_2_a1/}
}
TY  - JOUR
AU  - V. M. Efimov
AU  - K. V. Efimov
AU  - V. Yu. Kovaleva
AU  - Yu. G. Matushkin
TI  - Principal components of genetic sequences: correlations and significance
JO  - Matematičeskaâ biologiâ i bioinformatika
PY  - 2021
SP  - 299
EP  - 316
VL  - 16
IS  - 2
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/MBB_2021_16_2_a1/
LA  - ru
ID  - MBB_2021_16_2_a1
ER  - 
%0 Journal Article
%A V. M. Efimov
%A K. V. Efimov
%A V. Yu. Kovaleva
%A Yu. G. Matushkin
%T Principal components of genetic sequences: correlations and significance
%J Matematičeskaâ biologiâ i bioinformatika
%D 2021
%P 299-316
%V 16
%N 2
%I mathdoc
%U http://geodesic.mathdoc.fr/item/MBB_2021_16_2_a1/
%G ru
%F MBB_2021_16_2_a1
V. M. Efimov; K. V. Efimov; V. Yu. Kovaleva; Yu. G. Matushkin. Principal components of genetic sequences: correlations and significance. Matematičeskaâ biologiâ i bioinformatika, Tome 16 (2021) no. 2, pp. 299-316. http://geodesic.mathdoc.fr/item/MBB_2021_16_2_a1/

[1] V. M. Efimov, K. V. Efimov, V. Yu. Kovaleva, “Metod glavnykh komponent i ego obobscheniya dlya posledovatelnosti lyubogo tipa (PCA-Seq)”, Vavilovskii zhurnal genetiki i selektsii, 23:8 (2019), 1032–1036 | DOI | MR

[2] Duras T., “The fixed effects PCA model in a common principal component environment”, Communications in Statistics-Theory and Methods, 2020, 1–21 | DOI | MR

[3] Efron B., “Bootstrap Methods: Another Look at the Jackknife”, The Annals of Statistics, 7 (1979), 1–26 | DOI | MR | Zbl

[4] M. E. Timmerman, H. A. Kiers, A. K. Smilde, “Estimating confidence intervals for principal component loadings: a comparison between the bootstrap and asymptotic results”, British Journal of Mathematical and Statistical Psychology, 60:2 (2007), 295–314 | DOI

[5] M. Linting, J. J. Meulman, P. J. Groenen, A. J. Van der Kooij, “Stability of nonlinear principal components analysis: An empirical study using the balanced bootstrap”, Psychological methods, 12:3 (2007), 359 | DOI

[6] V. Efimov, K. Efimov, Kovaleva V., “Anchored Bootstrap”, 2020 Cognitive Sciences, Genomics and Bioinformatics (CSGB), IEEE, 2020, 32–35 | DOI

[7] R. Hendus-Altenburger, J. Vogensen, E. S. Pedersen, A. Luchini, R. Araya-Secchi, A. H. Bendsoe, Nanditha Shyam Prasad, Andreas Prestel, Marite Cardenas, Elena Pedraz-Cuesta, Lise Arleth, Stine F. Pedersen, Kragelund B. B., “The intracellular lipid-binding domain of human Na+/H+ exchanger 1 forms a lipid-protein co-structure essential for activity”, Communications Biology, 3:1 (2020), 1–18 | DOI

[8] A. Koch, A. Schwab, “Cutaneous pH landscape as a facilitator of melanoma initiation and progression”, Acta Physiologica, 225:1 (2019), e13105 | DOI

[9] I. Bohme, R. Schonherr, J. Eberle, A. K. Bosserhoff, “Membrane Transporters and Channels in Melanoma”, Reviews of Physiology, Biochemistry and Pharmacology, 2020, 1–106 | DOI

[10] Z. Petho, K. Najder, T. Carvalho, R. McMorrow, L. M. Todesca, M. Rugi, E. Bulk, A. Chan, C. W.G. M. Lowik, S. J. Reshkin, A. Schwab, “pH-channeling in cancer: How pH-dependence of cation channels shapes cancer pathophysiology”, Cancers, 12:9 (2020), 2484 | DOI

[11] D. Polunin, I. Shtaiger, V. Efimov, “JACOBI4 software for multivariate analysis of biological data”, bioRxiv, 2019, 803684 | DOI

[12] O. Hammer, D. A. Harper, P. D. Ryan, “PAST: Paleontological statistics software package for education and data analysis”, Palaeontologia Electronica, 4:1 (2001) (data obrascheniya: 05.09.2021) http://palaeo-electronica.org/2001_1/past/issue1_01.htm

[13] T. Hill, P. Lewicki, Statistics, StatSoft Ltd, UK, 2006, 719 pp.

[14] NCBI, (data obrascheniya: 05.09.2021) https://www.ncbi.nlm.nih.gov

[15] J. C. Gower, “Some distance properties of latent root and vector methods used in multivariate analysis”, Biometrika, 53:3–4 (1966), 325–338 | DOI | MR | Zbl

[16] M. Nei, S. Kumar, Molekulyarnaya evolyutsiya i filogenetika, KVSch, Kiev, 2004

[17] V. M. Efimov, M. A. Melchakova, V. Yu. Kovaleva, “Geometricheskie svoistva evolyutsionnykh distantsii”, Vavilovskii zhurnal genetiki i selektsii, 17:4/1 (2013), 714–723

[18] AAindex (v.9.2 ot 13.02.2017), (data obrascheniya: 05.09.2021) https://www.genome.jp/aaindex

[19] S. Kawashima, P. Pokarowski, M. Pokarowska, A. Kolinski, T. Katayama, M. Kanehisa, “AAindex: amino acid index database, progress report 2008”, Nucleic Acids Research, 36:1 (2008), D202–D205 | DOI

[20] P. H.A. Sneath, “Relations between chemical structure and biological activity in peptides”, Journal of Theoretical Biology, 12:2 (1966), 157–195 | DOI

[21] S. Hellberg, M. Sjoestroem, B. Skagerberg, S. Wold, “Peptide quantitative structure-activity relationships, a multivariate approach”, Journal of Medicinal Chemistry, 30:7 (1987), 1126–1135 | DOI

[22] M. Sandberg, L. Eriksson, J. Jonsson, M. Sjostrom, S. Wold, “New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids”, Journal of Medicinal Chemistry, 41:14 (1988), 2481–2491 | DOI

[23] A. A. Kosky, V. Dharmavaram, G. Ratnaswamy, M. C. Manning, “Multivariate analysis of the sequence dependence of asparagine deamidation rates in peptides”, Pharmaceutical Research, 26:11 (2009), 2417–2428 | DOI

[24] N. J. Zbacnik, C. S. Henry, M. C. Manning, “A Chemometric Approach Toward Predicting the Relative Aggregation Propensity: A$\beta$ (1–42)”, Journal of Pharmaceutical Sciences, 109:1 (2020), 624–632 | DOI

[25] MPI Bioinformatics Toolkit, (data obrascheniya: 05.09.2021) https://toolkit.tuebingen.mpg.de

[26] L. Zimmermann, A. Stephens, S. Z. Nam, D. Rau, J. Kubler, M. Lozajic, F. Gabler, J. Soding, A. N. Lupas, V. Alva, “A completely reimplemented MPI bioinformatics toolkit with a new HHpred server at its core”, Journal of Molecular Biology, 430:15 (2018), 2237–2243 | DOI

[27] D. T. Jones, “Protein secondary structure prediction based on position-specific scoring matrices”, Journal of Molecular Biology, 292:2 (1999), 195–202 | DOI

[28] R. Heffernan, Y. Yang, K. Paliwal, Y. Zhou, “Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility”, Bioinformatics, 33:18 (2017), 2842–2849 | DOI

[29] R. Yan, D. Xu, J. Yang, S. Walker, Y. Zhang, “A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction”, Scientific Reports, 3 (2013), 2619 | DOI

[30] S. Wang, J. Peng, J. Ma, J. Xu, “Protein secondary structure prediction using deep convolutional neural fields”, Scientific Reports, 6 (2016), 18962 | DOI

[31] M. S. Klausen, M. C. Jespersen, H. Nielsen, K. K. Jensen, V. I. Jurtz, C. K. Soenderby, M. O.A. Sommer, O. Winther, M. Nielsen, B. Petersen, P. Marcatili, “NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning”, Proteins: Structure, Function, and Bioinformatics, 87:6 (2019), 520–527 | DOI

[32] A. Krogh, B. Larsson, G. Von Heijne, E. L. Sonnhammer, “Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes”, Journal of Molecular Biology, 305:3 (2001), 567–580 | DOI

[33] L. Kall, A. Krogh, E. L. Sonnhammer, “A combined transmembrane topology and signal peptide prediction method”, Journal of Molecular Biology, 338:5 (2004), 1027–1036 | DOI

[34] L. Kall, A. Krogh, E. L. Sonnhammer, “An HMM posterior decoder for sequence feature prediction that includes homology information”, Bioinformatics, 21:1 (2005), i251–i257 | DOI

[35] J. B. Kruskal, “Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis”, Psychometrika, 29:2 (1964), 1–27 | DOI | MR

[36] E. A. Kel, N. A. Kolchanov, V. V. Solovev, “Konvergentnoe proiskhozhdenie povtorov v genakh, kodiruyuschikh globulyarnye belki. Analiz faktorov, obuslavlivayuschikh nalichie pryamykh povtorov”, Zhurn. obsch. biol., 49:3 (1988), 343–354

[37] N. A. Kolchanov, E. A. Kel, V. V. Solovev, “Konvergentnoe proiskhozhdenie povtorov v genakh, kodiruyuschikh globulyarnye belki. Modelirovanie konvergentnogo vozniknoveniya pryamykh povtorov”, Zhurn. obsch. biol., 49:6 (1988), 723–728

[38] C. P. Chen, A. Kernytsky, B. Rost, “Transmembrane helix predictions revisited”, Protein Science, 11:12 (2002), 2774–2791 | DOI

[39] T. Lesnik, C. Reiss, “Detection of transmembrane helical segments at the nucleotide level in eukaryotic membrane protein genes”, Biochem. Mol. Biol. Int., 44:3 (1998), 471–479 | DOI

[40] H. Nakashima, A. Yoshihara, K. I. Kitamura, “Favorable and unfavorable amino acid residues in water-soluble and transmembrane proteins”, J. Biomedical Science and Engineering, 6:1 (2013), 36–44 | DOI

[41] N. Vakirlis, O. Acar, B. Hsu, N. C. Coelho, S. B. Van Oss, A. Wacholder, K. Medetgul-Ernar, R. W. Bowman II, C. P. Hines, Iannotta J. et all, “De novo emergence of adaptive membrane proteins from thymine-rich genomic sequences”, Nature communications, 11:1 (2020), 1–18 | DOI