Cleaning data sets with diagnostic errors in the high-dimensional feature spaces
Matematičeskaâ biologiâ i bioinformatika, Tome 14 (2019) no. 2, pp. 464-476.

Voir la notice de l'article provenant de la source Math-Net.Ru

The paper proposes a new approach in data censoring, which allows correcting diagnostic errors in the data sets in case when these samples are described in high-dimensional feature spaces. Considering this case as a separate task is explained by the fact that in high-dimensional spaces most of the methods of outliers detection and data filtering, both statistical and metric, stop working. At the same time, for the tasks of medical diagnostics, given the complexity of the objects and phenomena studied, a large number of descriptive characteristics are the norm rather than the exception. To solve this problem, an approach that focuses on local similarity between objects belonging to the same class and uses the function of rival similarity (FRiS function) as a measure of similarity has been proposed. In this approach for efficient data cleaning from misclassified objects, the most informative and relevant low-dimensional feature subspace is selected, in which the separability of classes after their correction will be maximal. Class separability here means the similarity of objects of one class to each other and their dissimilarity to objects of another class. Cleaning data from class errors can consist both in their correction and removing the objects-outliers from the data set. The described method was implemented as a FRiS-LCFS algorithm (FRiS Local Censoring with Feature Selection) and tested on model and real biomedical problems, including the problem of diagnosing prostate cancer based on DNA microarray analysis. The developed algorithm showed its competitiveness in comparison with the standard methods for filtering data in high-dimensional spaces.
@article{MBB_2019_14_2_a1,
     author = {I. A. Borisova and O. A. Kutnenko},
     title = {Cleaning data sets with diagnostic errors in the high-dimensional feature spaces},
     journal = {Matemati\v{c}eska\^a biologi\^a i bioinformatika},
     pages = {464--476},
     publisher = {mathdoc},
     volume = {14},
     number = {2},
     year = {2019},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MBB_2019_14_2_a1/}
}
TY  - JOUR
AU  - I. A. Borisova
AU  - O. A. Kutnenko
TI  - Cleaning data sets with diagnostic errors in the high-dimensional feature spaces
JO  - Matematičeskaâ biologiâ i bioinformatika
PY  - 2019
SP  - 464
EP  - 476
VL  - 14
IS  - 2
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/MBB_2019_14_2_a1/
LA  - ru
ID  - MBB_2019_14_2_a1
ER  - 
%0 Journal Article
%A I. A. Borisova
%A O. A. Kutnenko
%T Cleaning data sets with diagnostic errors in the high-dimensional feature spaces
%J Matematičeskaâ biologiâ i bioinformatika
%D 2019
%P 464-476
%V 14
%N 2
%I mathdoc
%U http://geodesic.mathdoc.fr/item/MBB_2019_14_2_a1/
%G ru
%F MBB_2019_14_2_a1
I. A. Borisova; O. A. Kutnenko. Cleaning data sets with diagnostic errors in the high-dimensional feature spaces. Matematičeskaâ biologiâ i bioinformatika, Tome 14 (2019) no. 2, pp. 464-476. http://geodesic.mathdoc.fr/item/MBB_2019_14_2_a1/

[1] T. de Waal, J. Pannekoek, S. Scholtus, Handbook of Statistical Data Editing and Imputation. Hoboken, John Wiley and Sons, Inc., New Jersey, 2011, 456 pp. | DOI

[2] V. Barnett, T. Lewis, Outliers in Statistical Data, John Wiley and Sons, Chichester, 1994, 584 pp. | MR | Zbl

[3] Jason W. Osborne, Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data, 1st Edition, SAGE Publication, Inc., Los Angeles, 2013, 296 pp. | DOI

[4] Luca Greco, Robust Methods for Data Reduction Alessio Farcomeni, Chapman and Hall/CRC, 2015, 297 pp.

[5] C. M. Teng, “A comparison of noise handling techniques”, Proceedings of the Fourteenth International Florida Artificial Intelligence Research Society Conference, 2001, 269–273

[6] C. C. Aggarwal, P. S. Yu, “Outlier detection for high dimensional data”, Proc. ACM SIGMOD Int. Conf. on Management of Data (California, USA, 2001) | DOI | MR

[7] Guyon I., Weston J., Barnhill S., Vapnik V., “Gene Selection for Cancer Classification using Support Vector Machines”, Machine Learning, 46:1 (2002), 389–422 | DOI | Zbl

[8] M. M. Breunig, H. P. Kriegel, R. T. Ng, J. R. Sander, “LOF: Identifying Density-based Local Outliers”, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000, 93–104 | DOI

[9] F. T. Liu, K. M. Ting, Z. H. Zhou, “Isolation forest”, Proceedings of ICDM'08, Eighth IEEE International Conference on Data Mining, 2008, 413–422 | DOI

[10] H. P. Kriegel, M. Schubert, A. Zimek, “Angle-based outlier detection in high-dimensional data”, Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2008, 444–452 | DOI

[11] Y. Yang, X. Wu, X. Zhu, “Dealing with Predictive-but-Unpredictable Attributes in Noisy Data Sources”, Proceedings of 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, Springer, 2004 | DOI

[12] C. E. Brodley, M. A. Friedl, “Identifying Mislabeled Training Data”, Journal of Artificial Intelligence Research, 11 (1999), 131–167 | DOI | Zbl

[13] I. A. Borisova, O. A. Kutnenko, “Ispravlenie diagnosticheskikh oshibok v tselevom priznake s pomoschyu funktsii konkurentnogo skhodstva”, Matematicheskaya biologiya i bioinformatika, 13:1 (2018), 38–49

[14] I. A. Borisova, O. A. Kutnenko, “Tsenzurirovanie oshibochno klassifitsirovannykh ob'ektov vyborki”, Mashinnoe obuchenie i analiz dannykh, 1:11 (2015), 1632–1641

[15] Prostate Cancer Dataset, (accessed January 2019) http://www.bioinf.ucd.ie/people/ian/Singh.txt

[16] N. G. Zagoruiko, I. A. Borisova, V. V. Dyubanov, O. A. Kutnenko, “Methods of recognition based on the function of rival similarity”, Pattern Recognition and Image Analysis, 18:1 (2008), 1–6 | DOI

[17] N. G. Zagoruiko, Kognitivnyi analiz dannykh, Akademicheskoe izd-vo GEO, Novosibirsk, 2013, 186 pp.

[18] A. G. Arkadev, E. M. Braverman, Obuchenie mashiny klassifikatsii ob'ektov, Nauka, M., 1971, 112 pp.

[19] K. V. Vorontsov, A. O. Koloskov, “Profili kompaktnosti i vydelenie opornykh ob'ektov v metricheskikh algoritmakh klassifikatsii”, Iskusstvennyi intellekt, 2006, no. 2, 30–33

[20] M. I. Shlezinger, “O samoproizvolnom razdelenii obrazov”, Chitayuschie avtomaty i raspoznavanie obrazov, sb. nauch. trudov, Naukova dumka, Kiev, 1965, 46–61

[21] N. G. Zagoruiko, Prikladnye metody analiza dannykh i znanii, Izd. IM SO RAN, Novosibirsk, 1999, 270 pp.

[22] S. A. Subbotin, “Kompleks kharakteristik i kriteriev sravneniya obuchayuschikh vyborok dlya resheniya zadach diagnostiki i raspoznavaniya obrazov”, Matematichni mashini i sistemi, 2010, no. 1, 25–39 | MR

[23] N. G. Zagoruiko, O. A. Kutnenko, “Recognition methods based on the AdDel algorithm”, Pattern Recognition and Image Analysis, 14:2 (2004), 198–204