Approach to the selection of significant features in solving biomedical problems of binary classification of microarray data

I. Yu. Boiko; D. S. Anisimov; L. L. Smolyakova; M. A. Ryazanov

I. Yu. Boiko ; D. S. Anisimov ; L. L. Smolyakova ; M. A. Ryazanov

Matematičeskaâ biologiâ i bioinformatika, Tome 15 (2020) no. 1, pp. 4-19

Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

Résumé

In modern biomedical research aimed at finding methods for early diagnosis of cancer, microarrays containing certain biological information about patients are used. Based on these data, patients are assigned to one of two classes, corresponding to the presence and absence of some diagnosis. When solving this problem, one of the steps that have a decisive influence on the quality of classification is the significant features selection. This paper proposes a criterion for the selection of significant features, based on the ledge-coefficient of correlation. The ledge-coefficient was previously used to estimate the degree of interrelation of numerical and binary features. For two sets of microarray data, comparative examples of their binary classification are presented using three feature selection algorithms, three dimensionality reduction methods, six classification models. The use of the ledge-criterion for feature selection made it possible to obtain a classification quality comparable to the results of using common methods of feature selection, such as $t$-test and $U$-test. For the data set of the peptide microarrays considered in the paper, the effectiveness of applying the projection method to latent structures had previously been identified. The use of this method in combination with the significant features’ selection using the ledge-criterion made it possible to obtain a higher classification quality measure.

Export
Comment citer

@article{MBB_2020_15_1_a2,
     author = {I. Yu. Boiko and D. S. Anisimov and L. L. Smolyakova and M. A. Ryazanov},
     title = {Approach to the selection of significant features in solving biomedical problems of binary classification of microarray data},
     journal = {Matemati\v{c}eska\^a biologi\^a i bioinformatika},
     pages = {4--19},
     year = {2020},
     volume = {15},
     number = {1},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MBB_2020_15_1_a2/}
}

TY  - JOUR
AU  - I. Yu. Boiko
AU  - D. S. Anisimov
AU  - L. L. Smolyakova
AU  - M. A. Ryazanov
TI  - Approach to the selection of significant features in solving biomedical problems of binary classification of microarray data
JO  - Matematičeskaâ biologiâ i bioinformatika
PY  - 2020
SP  - 4
EP  - 19
VL  - 15
IS  - 1
UR  - http://geodesic.mathdoc.fr/item/MBB_2020_15_1_a2/
LA  - ru
ID  - MBB_2020_15_1_a2
ER  -

%0 Journal Article
%A I. Yu. Boiko
%A D. S. Anisimov
%A L. L. Smolyakova
%A M. A. Ryazanov
%T Approach to the selection of significant features in solving biomedical problems of binary classification of microarray data
%J Matematičeskaâ biologiâ i bioinformatika
%D 2020
%P 4-19
%V 15
%N 1
%U http://geodesic.mathdoc.fr/item/MBB_2020_15_1_a2/
%G ru
%F MBB_2020_15_1_a2

I. Yu. Boiko; D. S. Anisimov; L. L. Smolyakova; M. A. Ryazanov. Approach to the selection of significant features in solving biomedical problems of binary classification of microarray data. Matematičeskaâ biologiâ i bioinformatika, Tome 15 (2020) no. 1, pp. 4-19. http://geodesic.mathdoc.fr/item/MBB_2020_15_1_a2/

Bibliographie
Cité par

[1] B. Y. Renard, M. Löwer, Y. Kühne, U. Reimer, A. Rothermel, O. Türeci, J. C. Castle, U. Sahin, “Rapmad: Robust analysis of peptide microarray data”, BMC Bioinformatics, 12 (2011) | DOI

[2] J.Önskog, E. Freyhult, M. Landfors, P. Rydén, T. R. Hvidsten, “Classification of microarrays; synergistic effects between normalization, gene selection and machine learning”, BMC Bioinformatics, 12 (2011) | DOI

[3] A. Mohammed, G. Biegert, J. Adamec, T. Helikar, “CancerDiscover: An integrative pipeline for cancer biomarker and cancer class prediction from high-throughput sequencing data”, Oncotarget, 9:2 (2018), 2565–2573 | DOI

[4] R. Alanni, J. Hou, H. Azzawi, Y. Xiang, “A novel gene selection algorithm for cancer classification using microarray datasets”, BMC Med Genomics, 12 (2019) | DOI

[5] M. Xi, J. Sun, L. Liu, F. Fan, X. Wu, “Cancer Feature Selection and Classification Using a Binary Quantum-Behaved Particle Swarm Optimization and Support Vector Machine”, Computational and Mathematical Methods in Medicine, 2016 (2016), 1–9 | DOI

[6] Z. Hira, D. Gillies, “A review of feature selection and feature extraction methods applied on microarray data”, Advances in Bioinformatics, 2015 (2015), 1–13 | DOI

[7] Y. Saeys, I. Inza, P. Larranaga, “A review of feature selection techniques in bioinformatics”, Bioinformatics, 23:19 (2007), 2507–2517 | DOI

[8] C. Lazar, J. Taminau, S. Meganck, D. Steenhoff, A. Coletta, C. Molter, V. de Schaetzen, R. Duque, H. Bersini, A. Nowe, “A survey on filter techniques for feature selection in gene expression microarray analysis”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9:4 (2012), 1106–1119 | DOI

[9] P. Jafari, F. Azuaje, “An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors”, BMC Medical Informatics and Decision Making, 6 (2006) | DOI | Zbl

[10] T. Nguyen, A. Khosravi, D. Creighton, S. Nahavandi, “Hierarchical Gene Selection and Genetic Fuzzy System for Cancer Microarray Data Classification”, PLoS ONE, 10:3 (2015) | DOI

[11] M. Shahjaman, M. Rahman, S. Islam, M. Mollah, “A Robust Approach for Identification of Cancer Biomarkers and Candidate Drugs”, Medicina, 55:6 (2019) | DOI

[12] M. Maniruzzaman, J. Rahman, B. Ahammed, M. Abedin, H. Suri, M. Biswas, A. El-Baz, P. Bangeas, G. Tsoulfas, J. Suri, “Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms”, Computer Methods and Programs in Biomedicine, 176 (2019), 173–193 | DOI

[13] M. Momenzadeh, M. Sehhati, H. Rabbani, “A novel feature selection method for microarray data classification based on hidden Markov model”, Journal of Biomedical Informatics, 95 (2019) | DOI

[14] M. Boareto, N. Caticha, “t-Test at the Probe Level: An Alternative Method to Identify Statistically Significant Genes for Microarray Data”, Microarrays, 3:4 (2014), 340–351 | DOI

[15] R. Fox, M. Dimmic, “A two-sample Bayesian t-test for microarray data”, BMC Bioinformatics, 7 (2006) | DOI

[16] A. Shukla, D. Tripathi, “Identification of potential biomarkers on microarray data using distributed gene selection approach”, Mathematical Biosciences, 315 (2019) | DOI | MR | Zbl

[17] V. Bolon-Canedo, N. Sanchez-Marono, A. Alonso-Betanzos, J. Benitez, F. Herrera, “A review of microarray datasets and applied feature selection methods”, Information Sciences, 282 (2014), 111–135 | DOI

[18] N. Aboudi, L. Benhlima, “Review on wrapper feature selection approaches”, 2016 International Conference on Engineering MIS, ICEMIS, IEEE, 2016, 1–5 | DOI

[19] H. Sanz, C. Valim, E. Vegas, J. Oller, F. Reverter, “SVM-RFE: selection and visualization of the most relevant features through non-linear kernels”, BMC Bioinformatics, 19 (2018) | DOI | Zbl

[20] Z. Li, W. Xie, T. Liu, “Efficient feature selection and classification for microarray data”, PLoS ONE, 13:8 (2018) | DOI

[21] A. Anaissi, P. Kennedy, M. Goyal, “Feature selection of imbalanced gene expression microarray data”, 2011 12th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, SNPD, IEEE, 2011, 73–78 | DOI

[22] C. Kang, Y. Huo, L. Xin, B. Tian, B. Yu, “Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine”, Journal of theoretical biology, 463 (2019), 77–91 | DOI | MR | Zbl

[23] L. Chuang, C. Yang, K. Wu, C. Yang, “A hybrid feature selection method for DNA microarray data”, Computers in Biology and Medicine, 41:4 (2011), 228–237 | DOI

[24] L. Huijuan, C. Junying, Y. Ke, J. Qun, X. Yu, G. Zhigang, “A hybrid feature selection algorithm for gene expression data classification”, Neurocomputing, 256 (2017), 56–62 | DOI

[25] A. Shukla, P. Singh, V. Vardhan, “A hybrid gene selection method for microarray recognition”, Biocybernetics and Biomedical Engineering, 38:4 (2018), 975–991 | DOI

[26] Y. Sun, C. Lu, X. Li, “The Cross-Entropy Based Multi-Filter Ensemble Method for Gene Selection”, Genes, 9:5 (2018) | DOI

[27] V. Bolon-Canedo, N. Sanchez-Marono, A. Alonso-Betanzos, “An ensemble of filters and classifiers for microarray data classification”, Pattern Recognition, 45:1 (2012), 531–539 | DOI

[28] V. Bolon-Canedo, N. Sanchez-Marono, A. Alonso-Betanzos, “Data classification using an ensemble of filters”, Neurocomputing, 135 (2014), 13–20 | DOI

[29] K. Strimbu, J. A. Tavel, What are Biomarkers?, Current Opinion in HIV and AIDS, 192:3 (2010), 214–216 | DOI

[30] S. V. Dronov, R. V. Petukhova, “Odin vid svyazi mezhdu nominalnoi i binarnoi peremennymi”, Izvestiya AltGU, 65:1/2 (2010), 34–36 | MR

[31] S. V. Dronov, I. Yu. Boiko, “Metod otsenki stepeni svyazi binarnogo i nominalnogo pokazatelei”, Prikladnaya diskretnaya matematika, 30:4 (2015), 109–119 | DOI | Zbl

[32] D. S. Anisimov, S. V. Podlesnykh, E. A. Kolosova, D. N. Scherbakov, V. D. Petrova, S. A. Dzhonston, A. F. Lazarev, N. M. Oskorbin, A. I. Shapoval, M. A. Ryazanov, “Analiz mnogomernykh dannykh peptidnykh mikrochipov s ispolzovaniem metoda proektsii na latentnye struktury”, Matematicheskaya biologiya i bioinformatika, 12:2 (2017), 435–445 | DOI

[33] E. Gravier, “A prognostic DNA signature for T1T2 node-negative breast cancer patients”, Genes, Chromosomes and Cancer, 49:12 (2010), 1125–1134 | DOI

[34] Student, “The probable error of a mean”, Biometrika, 6:1 (1908), 1–25 | DOI

[35] H. B. Mann, D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other”, Annals of Mathematical Statistics, 18 (1947), 50–60 | DOI | MR | Zbl

[36] D. S. Anisimov, M. A. Ryazanov, A. I. Shapoval, “Primenenie metoda proektsii na latentnye struktury v zadachakh klassifikatsii na primere dannykh peptidnykh mikrochipov”, Sbornik trudov vserossiiskoi konferentsii po matematike “MAK-2016” (Barnaul, 29 iyunya–1 iyulya 2016 g.), Izd-vo AltGU, Barnaul, 2016, 92

[37] K. Esbensen, Analiz mnogomernykh dannykh. Izbrannye glavy, Izd-vo Alt. un-ta, Barnaul, 2003, 157 pp.

[38] D. R. Cox, “The regression analysis of binary sequences”, Journal of the Royal Statistical Society, 20:2 (1958), 215–242 | DOI | MR | Zbl

[39] V. N. Vapnik, Vosstanovlenie zavisimostei po empiricheskim dannym, Nauka, M., 1979, 448 pp. | MR

[40] T. M. Cover, P. E. Hart, “Nearest neighbor pattern classification”, IEEE Transactions on Information Theory, 13:1 (1967), 21–27 | DOI | MR | Zbl

[41] L. Breiman, “Random Forests”, Machine Learning, 45:1 (2001), 5–32 | DOI | MR | Zbl

[42] B. E. Boser, I. M. Guyon, V. N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers”, Proceedings of the 5th Annual Workshop on Computational Learning Theory, COLT-92 (Pittsburgh, 27-29 July 1992), New York, 1992, 144–152 | DOI

[43] Hyperopt: Distributed Asynchronous Hyper-parameter Optimization, (data obrascheniya: 16.04.2019) https://github.com/hyperopt/hyperopt

[44] W. J. Youden, “Index for rating diagnostic tests”, Cancer, 3:1 (1950), 32–35 | 3.0.CO;2-3 class='badge bg-secondary rounded-pill ref-badge extid-badge'>DOI

Parcourir par

Geodesic

Parcourir par