Parallel frequent itemset mining on the Intel MIC accelerators

M. L. Tsymbler

Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika, Tome 8 (2019) no. 1, pp. 54-70

Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

Résumé

Association rule mining is one of the basic problems of data mining, which supposes finding strong correlations between itemsets in large transaction database. Association rules are generated from frequent itemsets (itemset is frequent if its items frequent occur together in transactions). The DIC (Dynamic Itemset Counting) algorithm is modification of the classical Apriori algorithm of finding frequent itemsets. DIC tries to reduce the number of passes made over the transaction database while keeping the number of itemsets counted in a pass relatively low. The paper addresses the task of accelerating DIC on the Intel MIC (Many Integrated Core) systems in the case when the transaction database fits into the main memory. The paper presents a parallel implementation of DIC based on OpenMP technology and thread-level parallelism. We exploit the bit-based internal layout for transactions and itemsets. This technique simplifies the support count via logical bitwise operation, and allows for vectorization of such a step. Experiments with large synthetic and real databases showed good performance and scalability of the proposed algorithm.

Keywords: data mining, frequent itemset counting, OpenMP, Intel Many Integrated Core.

@article{VYURV_2019_8_1_a3,
     author = {M. L. Tsymbler},
     title = {Parallel frequent itemset mining on the {Intel} {MIC} accelerators},
     journal = {Vestnik \^U\v{z}no-Uralʹskogo gosudarstvennogo universiteta. Seri\^a Vy\v{c}islitelʹna\^a matematika i informatika},
     pages = {54--70},
     year = {2019},
     volume = {8},
     number = {1},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/VYURV_2019_8_1_a3/}
}

TY  - JOUR
AU  - M. L. Tsymbler
TI  - Parallel frequent itemset mining on the Intel MIC accelerators
JO  - Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika
PY  - 2019
SP  - 54
EP  - 70
VL  - 8
IS  - 1
UR  - http://geodesic.mathdoc.fr/item/VYURV_2019_8_1_a3/
LA  - ru
ID  - VYURV_2019_8_1_a3
ER  -

%0 Journal Article
%A M. L. Tsymbler
%T Parallel frequent itemset mining on the Intel MIC accelerators
%J Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika
%D 2019
%P 54-70
%V 8
%N 1
%U http://geodesic.mathdoc.fr/item/VYURV_2019_8_1_a3/
%G ru
%F VYURV_2019_8_1_a3

M. L. Tsymbler. Parallel frequent itemset mining on the Intel MIC accelerators. Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika, Tome 8 (2019) no. 1, pp. 54-70. http://geodesic.mathdoc.fr/item/VYURV_2019_8_1_a3/

Bibliographie
Cité par

[1] V. V. Voevodin, Vl. V. Voevodin, Parallel Computing, BHV-Petersburg, SPb, 2002, 608 pp.

[2] R. Agrawal, R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases”, VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases (September 12–15, 1994, Santiago de Chile, Chile), 1994, 487–499

[3] D. F. Bacon, S. L. Graham, O. J. Sharp, “Compiler Transformations for High-Performance Computing”, ACM Computing Surveys, 26:4 (1994), 345–420 | DOI

[4] C. Borgelt, “Frequent Item Set Mining”, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2:6 (2012), 437–456 | DOI

[5] S. Brin, R. Motwani, J. D. Ullman, S. Tsur, “Dynamic Itemset Counting and Implication Rules for Market Basket Data”, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data (May 13–15, 1997, Tucson, Arizona, USA), 1997, 255–264 | DOI

[6] D. Burdick, M. Calimlim, J. Flannick, et al., “MAFIA: A Maximal Frequent Itemset Algorithm”, IEEE Transactions on Knowledge and Data Engineering, 17:11 (2005), 1490–1504 | DOI

[7] Y. Chang, J. Chen, Y. Tsai, “Mining Subspace Clusters from DNA Microarray Data Using Large Itemset Techniques”, Journal of Computational Biology, 16:5 (2009), 745–768 | DOI

[8] D. W. Cheung, K. Hu, S. Xia, “An Adaptive Algorithm for Mining Association Rules on Shared-Memory Parallel Machines”, Distributed and Parallel Databases, 9:2 (2001), 99–132 | DOI | Zbl

[9] J. Dong, M. Han, “BitTableFI: An Efficient Mining Frequent Itemsets Algorithm”, Knowledge-Based Systems, 20:4 (2007), 329–335 | DOI

[10] A. Duran, M. Klemm, “The Intel Many Integrated Core Architecture”, 2012 International Conference on High Performance Computing and Simulation, HPCS 2012 (Madrid, Spain, July 2–6, 2012), 2012, 365–366 | DOI

[11] J. Han, J. Pei, Y. Yin, “Mining Frequent Patterns without Candidate Generation”, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (May 16–18, 2000, Dallas, Texas, USA), 1–12 | DOI

[12] IBM Quest Synthetic Data Generator } {\tt https://ibmquestdatagen.sourceforge.io/

[13] P. Kostenetskiy, A. Safonov, “SUSU Supercomputer Resources”, PCT’2016, International Scientific Conference on Parallel Computational Technologies (Arkhangelsk, Russia, March 29–31, 2016.), v. 1576, CEUR Workshop Proceedings, 561–573

[14] P. Kumar, P. Bhatt, R. Choudhury, “Bitwise Dynamic Itemset Counting Algorithm”, Proceedings of the ICCIC 2015, IEEE International Conference on Computational Intelligence and Computing Research (December 10–12, 2015, Madurai, India), 2015, 1–4 | DOI

[15] J. Li, A. W. Fu, P. Fahey, “Efficient Discovery of Risk Patterns in Medical Data”, Artificial Intelligence in Medicine, 45:1 (2009), 77–89 | DOI

[16] R. Lischner, STL Reference: Containers, Iterators, and Algorithms, O’Reilly, 2003, 120 pp.

[17] C. Ordonez, N. F. Ezquerra, C. A. Santana, “Constraining and Summarizing Association Rules in Medical Data”, Knowledge and Information Systems, 9:3 (2006), 1–2 | DOI

[18] C. Ordonez, C. A. Santana, L. de Braal, “Discovering Interesting Association Rules in Medical Data”, 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Dallas, Texas, USA, May 14, 2000, 2000, 78–85

[19] P. Paranjape-Voditel, U. Deshpande, “A DIC-based Distributed Algorithm for Frequent Itemset Generation”, Journal of Software, 6:2 (2011), 306–313 | DOI

[20] O. Pattanaprateep, M. McEvoy, J. Attia, A. Thakkinstian, “Evaluation of Rational Nonsteroidal Anti-inflammatory Drugs and Gastro-protective Agents Use; Association Rule Data Mining Using Outpatient Prescription Patterns”, BMC Medical Informatics and Decision Making, 17:1 (2017), 96:1–96:7 | DOI

[21] B. Schlegel, T. Karnagel, T. Kiefer, W. Lehner, “Scalable Frequent Itemset Mining on Many-core Processors”, Proceedings of the 9th International Workshop on Data Management on New Hardware, DaMoN 2013 (New York, NY, USA, June 24, 2013), 2013, 3 | DOI | Zbl

[22] I. Sokolinskaya, L. Sokolinsky, “Revised Pursuit Algorithm for Solving Non-stationary Linear Programming Problems on Modern Computing Clusters with Manycore Accelerators”, Supercomputing. RuSCDays 2016. Communications in Computer and Information Science, 687 (2016), 212–223 | DOI

[23] M. J. Zaki, “Scalable Algorithms for Association Mining”, IEEE Transactions on Knowledge and Data Engineering, 12:3 (2000), 372–390 | DOI

[24] M. Zymbler, “Accelerating Dynamic Itemset Counting on Intel Many-core Systems”, Proceedings of the 40th International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO’2017 (Opatija, Croatia, May 22–26, 2017), IEEE, 2017, 1575–1580 | DOI

Parcourir par

Geodesic

Parcourir par