Anomaly detection in JSON structured data
Prikladnaâ diskretnaâ matematika, no. 2 (2022), pp. 83-103.

Voir la notice de l'article provenant de la source Math-Net.Ru

In this paper, we address the problem of intrusion detection for modern web applications and mobile applications with the cloud-based server side, using malicious content detection in JSON data, which is currently one of the most popular data serialization and exchange formats between client and server parts of an application. We propose a method for building a JSON model for the given set of JSON objects capable of detection of structure and type anomalies. The model is based on the models for basic data types inside JSON collection objects and schema model that generalizes objects' structure in the collection. We performed experiments using modifications of objects' structures and insertions of code injection attack vectors such as SQL injections, OS command injections, and JavaScript/HTML injections. The analysis showed statistical significance between the model's predictions and the presence of anomalies in the data gathered from the real web applications' traffic. The quality of the model's predictions was measured using the Matthews correlation coefficient (MCC). The MCC values computed on the data were close to one which indicates the model's high efficiency in solving the problem of anomaly detection in JSON objects.
Keywords: web traffic security, anomaly detection, machine learning.
@article{PDM_2022_2_a4,
     author = {E. A. Shliakhtina and D. Yu. Gamayunov},
     title = {Anomaly detection in {JSON} structured data},
     journal = {Prikladna\^a diskretna\^a matematika},
     pages = {83--103},
     publisher = {mathdoc},
     number = {2},
     year = {2022},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/PDM_2022_2_a4/}
}
TY  - JOUR
AU  - E. A. Shliakhtina
AU  - D. Yu. Gamayunov
TI  - Anomaly detection in JSON structured data
JO  - Prikladnaâ diskretnaâ matematika
PY  - 2022
SP  - 83
EP  - 103
IS  - 2
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/PDM_2022_2_a4/
LA  - ru
ID  - PDM_2022_2_a4
ER  - 
%0 Journal Article
%A E. A. Shliakhtina
%A D. Yu. Gamayunov
%T Anomaly detection in JSON structured data
%J Prikladnaâ diskretnaâ matematika
%D 2022
%P 83-103
%N 2
%I mathdoc
%U http://geodesic.mathdoc.fr/item/PDM_2022_2_a4/
%G ru
%F PDM_2022_2_a4
E. A. Shliakhtina; D. Yu. Gamayunov. Anomaly detection in JSON structured data. Prikladnaâ diskretnaâ matematika, no. 2 (2022), pp. 83-103. http://geodesic.mathdoc.fr/item/PDM_2022_2_a4/

[1] JSON Schema, , 2021 www.json-schema.org

[2] A. A. Frozza, R. dos Santos Mello, F. S. da Costa, “An approach for schema extraction of JSON and extended JSON document collections”, IEEE Intern. Conf. IRI (6–9 July, 2018), 356–363

[3] M. Klettke, U. Störl, S. Scherzinger, “Schema extraction and structural outlier detection for JSON-based NoSQL data stores”, Conf. BTW (Hamburg, Germany, 4–6 March, 2015), 425–444

[4] M. A. Baazizi, D. Colazzo, G. Ghelli et al, “Parametric schema inference for massive JSON datasets”, VLDB J., 28:4 (2019), 497–521 | DOI

[5] B. N. Miller, Detection of Malicious Content in JSON Structured Data using Multiple Concurrent Anomaly Detection Methods, Dissertation, Eastern Michigan University, 2016, 125 pp.

[6] Payload Box, , 2021 www.github.com/payloadbox

[7] FuzzDB Project, , 2021 www.github.com/fuzzdb-project/fuzzdb

[8] SQL injection dataset, , 2021 www.kaggle.com/syedsaqlainhussain/sql-injection-dataset

[9] Cross site scripting XSS dataset for Deep learning, , 2021 www.kaggle.com/syedsaqlainhussain/cross-site-scripting-xss-dataset-for-deeplearning

[10] P. Baldi, S. Brunak, Y. Chauvin et al, “Assessing the accuracy of prediction algorithms for classification: an overview”, Bioinformatics, 16:5 (2000), 412–424 | DOI

[11] D. Chicco, G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation”, BMC Genomics, 21:1 (2020), 1–13 | DOI

[12] D. Chicco, N. Totsch, G. Jurman, “The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation”, BioData Mining, 14:1 (2021), 1–22 | DOI