The problem of correction diagnostic errors in the target attribute with the function of rival similarity

I. A. Borisova; O. A. Kutnenko

I. A. Borisova ; O. A. Kutnenko

Matematičeskaâ biologiâ i bioinformatika, Tome 13 (2018) no. 1, pp. 38-49 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

Résumé

The problem of outliers detection is one of the important problems in Data Mining of biomedical datasets particularly in case when there could be misclassified objects, caused by diagnostic pitfalls on a stage of a data collection. Occurrence of such objects complicates and slows down dataset processing, distorts and corrupts detected regularities, reduces their accuracy score. We propose the censoring algorithm which could detect misclassified objects after which they are either removed from the dataset or the class attribute of such objects is corrected. Correction procedure keeps the volume of the analyzed dataset as big as it is possible. Such quality is very useful in case of small datasets analysis, when every bit of information can be important. The base concept in the presented work is a measure of similarity of objects with its surroundings. To evaluate the local similarity of the object with its closest neighbors the ternary relative measure called the function of rival similarity (FRiS-function) is used. Mean of similarity values of all objects in the dataset gives us a notion of a class’s separability, how close objects from the same class are to each other and how far they are from the objects of the different classes (with the different diagnosis) in the attribute space. It is supposed misclassified objects are more similar to objects from rival classes, than their own class, so their elimination from the dataset, or the target attribute correction should increase data separability value. The procedure of filtering-correcting of misclassified objects is based on the observation of changes in the evaluation of data separability calculated before and after making corrections to the dataset. The censoring process continues until the inflection point of the separability function is reached. The proposed algorithm was tested on a wide range of model tasks of different complexity. Also it was tested on biomedical tasks such as Pima Indians Diabetes data set, Breast Cancer data set and Parkinson data set. On these tasks the censoring algorithm showed high misclassification sensitivity. Accuracy score increasing and data set volume preservation after censoring procedure proved our base assumptions and the effectiveness of the algorithm.

Export
Comment citer

@article{MBB_2018_13_1_a10,
     author = {I. A. Borisova and O. A. Kutnenko},
     title = {The problem of correction diagnostic errors in the target attribute with the function of rival similarity},
     journal = {Matemati\v{c}eska\^a biologi\^a i bioinformatika},
     pages = {38--49},
     year = {2018},
     volume = {13},
     number = {1},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MBB_2018_13_1_a10/}
}

TY  - JOUR
AU  - I. A. Borisova
AU  - O. A. Kutnenko
TI  - The problem of correction diagnostic errors in the target attribute with the function of rival similarity
JO  - Matematičeskaâ biologiâ i bioinformatika
PY  - 2018
SP  - 38
EP  - 49
VL  - 13
IS  - 1
UR  - http://geodesic.mathdoc.fr/item/MBB_2018_13_1_a10/
LA  - ru
ID  - MBB_2018_13_1_a10
ER  -

%0 Journal Article
%A I. A. Borisova
%A O. A. Kutnenko
%T The problem of correction diagnostic errors in the target attribute with the function of rival similarity
%J Matematičeskaâ biologiâ i bioinformatika
%D 2018
%P 38-49
%V 13
%N 1
%U http://geodesic.mathdoc.fr/item/MBB_2018_13_1_a10/
%G ru
%F MBB_2018_13_1_a10

I. A. Borisova; O. A. Kutnenko. The problem of correction diagnostic errors in the target attribute with the function of rival similarity. Matematičeskaâ biologiâ i bioinformatika, Tome 13 (2018) no. 1, pp. 38-49. http://geodesic.mathdoc.fr/item/MBB_2018_13_1_a10/

Bibliographie
Cité par

[1] de Waal T., Pannekoek J., Scholtus S., Handbook of Statistical Data Editing and Imputation, John Wiley and Sons, Inc., Hoboken, New Jersey, 2011, 456 pp. | DOI

[2] Jason W. Osborne, Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data, 1st Edition, SAGE Publication, Inc., Los Angeles, 2013, 296 pp. | DOI

[3] Alessio Farcomeni, Luca Greco, Robust Methods for Data Reduction, Chapman and Hall/CRC, 2015, 297 pp.

[4] Teng C. M., “A comparison of noise handling techniques”, Proceedings of the Fourteenth International Florida Artificial Intelligence Research Society Conference, 2001, 269–273

[5] Quinlan J. R., “Induction of decision trees”, Machine Learning, 1:1 (1986), 81–106 | DOI | DOI | MR

[6] Frünay B., Verleysen M., “Classification in the Presence of Label Noise: a Survey”, IEEE Transactions on neural networks and learning systems, 25:5 (2014), 845–869 | DOI

[7] Segata N., Blanzieri E., “Noise Reduction for Instance-Based Learning with a Local Maximal Margin Approach”, Journal of Intelligent Information Systems, 35 (2010) | DOI

[8] Massie S., Craw S., Wiratunga N., “When Similar Problems Don't Have Similar Solutions”, Proceedings of the 7th International Conference on Case-Based Reasoning, ICCBR 07, Springer-Verlag, Berlin–Heidelberg, 2007, 92–106 | DOI

[9] Son S.-H., Kim J.-Y., “Data Reduction for Instance-Based Learning Using Entropy-Based Partitioning”, Proceedings of the International Conference on Computational Science and Its Applications, 2006, 590–599 | DOI

[10] Delany S. J., Segata N., Mac Namee B., “Profiling Instances in Noise Reduction”, Knowledge-Based Systems, 31 (2012), 28–40 | DOI

[11] Borisova I. A., Kutnenko O. A., “Tsenzurirovanie oshibochno klassifitsirovannykh ob'ektov vyborki”, Mashinnoe obuchenie i analiz dannykh, 1:11 (2015), 1632–1641

[12] Yang Y., Wu X., Zhu X., “Dealing with Predictive-but-Unpredictable Attributes in Noisy Data Sources”, Proceedings of 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (Pisa, Italy, 2004) | DOI

[13] Brodley C. E., Friedl M. A., “Identifying Mislabeled Training Data”, Journal of Artificial Intelligence Research, 1999, no. 11, 131–167 | DOI | Zbl

[14] Wilson D. R., Martinez T. R., “Reduction Techniques for Instance-Based Learning Algorithms”, Machine Learning, 38:3 (2000), 257–286 | DOI | MR | Zbl

[15] Jankowski N., Grochowski M., “Comparison of Instances Seletion Algorithms I. Algorithms Survey”, Artificial Intelligence and Soft Computing, 2004, 1–6 | DOI

[16] Brighton H., Mellish C., “Advances in Instance Selection for Instance-Based Learning Algorithms”, Data Mining and Knowledge Discovery, 6 (2002), 153–172 | DOI | MR | Zbl

[17] Aggarwal C. C., “Outlier analysis”, Data Mining, Springer International Publishing, 2015, 237–263 | DOI | MR

[18] Zagoruiko N. G., Borisova I. A., Dyubanov V. V., Kutnenko O. A., “Methods of recognition based on the function of rival similarity”, Pattern Recognition and Image Analysis, 18:1 (2008), 1–6 | DOI

[19] Zagoruiko N. G., Kognitivnyi analiz dannykh, Akademicheskoe izd-vo GEO, Novosibirsk, 2013, 186 pp.

[20] Zagoruiko N. G., Prikladnye metody analiza dannykh i znanii, Izd. IM SO RAN, Novosibirsk, 1999, 270 pp.

[21] Zagoruiko N. G., Borisova I. A., Kutnenko O. A., Dyubanov V. V., “Postroenie szhatogo opisaniya dannykh s ispolzovaniem funktsii konkurentnogo skhodstva”, Sibirskii zhurnal industrialnoi matematiki, KhVI:1(53) (2013), 29–41

[22] Subbotin S. A., “Kompleks kharakteristik i kriteriev sravneniya obuchayuschikh vyborok dlya resheniya zadach diagnostiki i raspoznavaniya obrazov”, Matematichni mashini i sistemi, 2010, no. 1, 25–39

[23] Breast Cancer Wisconsin (Diagnostic) Data Set, (accessed July 2016) http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+

[24] Pima Indians Diabetes Data Set, (accessed July 2016) https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

[25] Parkinsons Data Set, (accessed July 2016) https://archive.ics.uci.edu/ml/datasets/Parkinsons

[26] Wilson D. R., Martinez T. R., “Reduction Techniques for Instance-Based Learning Algorithms”, Machine learning, 38:3 (2000), 257–286 | DOI | MR | Zbl

[27] Keinosuke Fukunaga, Introduction to statistical pattern recognition, Academic Press, 1972 | MR

Parcourir par

Geodesic

Parcourir par