Overview of methods for integrating data mining into DBMS
Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika, Tome 8 (2019) no. 2, pp. 32-62 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

Data Mining is aimed to discovering understandable knowledge from data, which can be used for decision-making in various fields of human activity. The Big Data phenomenon is a characteristic feature of the modern information society. The processes of cleaning and structuring Big data lead to the formation of very large databases and data warehouses. Despite the emergence of a large number of NoSQL DBMSs, the main database management tool is still relational DBMS. Integration of Data Mining into relational DBMS is one of the promising directions of development of relational databases. Integration allows both to avoid the overhead of exporting the analyzed data from the repository and importing the analysis results back to the repository, as well as using system services embedded in the DBMS architecture for data analysis. The paper provides an overview of methods and approaches to solving the problem of integrating data mining in a DBMS. A classification of approaches to solving the problem of integrating data mining in a DBMS is given. The SQL database language extensions to provide syntactic support for data mining in a DBMS are introduced. Examples of the implementation of data mining algorithms for SQL and data analysis systems in relational databases are considered.
Keywords: data mining, relational DBMS, clustering, pattern mining.
Mots-clés : classification
@article{VYURV_2019_8_2_a2,
     author = {M. L. Zymbler},
     title = {Overview of methods for integrating data mining into {DBMS}},
     journal = {Vestnik \^U\v{z}no-Uralʹskogo gosudarstvennogo universiteta. Seri\^a Vy\v{c}islitelʹna\^a matematika i informatika},
     pages = {32--62},
     year = {2019},
     volume = {8},
     number = {2},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/VYURV_2019_8_2_a2/}
}
TY  - JOUR
AU  - M. L. Zymbler
TI  - Overview of methods for integrating data mining into DBMS
JO  - Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika
PY  - 2019
SP  - 32
EP  - 62
VL  - 8
IS  - 2
UR  - http://geodesic.mathdoc.fr/item/VYURV_2019_8_2_a2/
LA  - ru
ID  - VYURV_2019_8_2_a2
ER  - 
%0 Journal Article
%A M. L. Zymbler
%T Overview of methods for integrating data mining into DBMS
%J Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika
%D 2019
%P 32-62
%V 8
%N 2
%U http://geodesic.mathdoc.fr/item/VYURV_2019_8_2_a2/
%G ru
%F VYURV_2019_8_2_a2
M. L. Zymbler. Overview of methods for integrating data mining into DBMS. Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika, Tome 8 (2019) no. 2, pp. 32-62. http://geodesic.mathdoc.fr/item/VYURV_2019_8_2_a2/

[1] R. M. Miniakhmetov, M. L. Zymbler, “Integration of Fuzzy c-Means Clustering algorithm with PostgreSQL database management system”, Numerical Methods and Programming, 13 (2012), 46–52

[2] T. V. Rechkalov, “An Approach to Integration of Data Mining with Relational DBMS Based on Automatic SQL Code Generation”, Bulletin of the South Ural State University. Series: Computational Mathematics and Software Engineering, 2:1 (2013), 114–121

[3] R. Agrawal, A. Ailamaki, P. A. Bernstein, et al., “The Claremont Report on Database Research”, Commun. ACM, 52:6 (2009), 56–65 | DOI

[4] R. Agrawal, R. Srikant, Proceedings of 20th International Conference on Very Large Data Bases, VLDB 94 (September 12–15, 1994, Santiago de Chile, Chile), 1994, 487–499

[5] R. Agrawal, K. Shim, “Developing Tightly-coupled Data Mining Applications on a Relational Database System”, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96) (Portland, Oregon, USA), 1996, 287–290

[6] D. Abadi, R. Agrawal, A. Ailamaki, et al., “The Beckman Report on Database Research”, Commun. ACM, 59:2 (2016), 92–99 | DOI

[7] A. Alashqur, “RDB-MINER: A SQL-Based Algorithm for Mining True Relational Databases”, Journal of Software, 5:9 (2010), 998–1005 | DOI

[8] R. Balachandran, S. Padmanabhan, S. Chakravarthy, “Enhanced DBSubdue: Supporting Subtle Aspects of Graph Mining Using a Relational Approach”, Advances in Knowledge Discovery and Data Mining, 10th Pacific-Asia Conference, PAKDD 2006 (Singapore, April 9–12, 2006), 2006, 673–678 | DOI

[9] M. R. Berthold, N. Cebron, F. Dill, et al., “KNIME - the Konstanz Information Miner: Version 2.0 and Beyond”, SIGKDD Explorations, 11:1 (2009), 26–31 | DOI

[10] J. C. Bezdek, R. Ehrlich, W. Full, “FCM: The Fuzzy C-Means Clustering Algorithm”, Computers and Geosciences, 10:2 (1984), 191–203 | DOI

[11] H. Blockeel, T. Calders, E. Fromont, et al., “An Inductive Database Prototype Based on Virtual Mining Views”, Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Las Vegas, Nevada, USA, August 24–27, 2008), 2008, 1061–1064 | DOI

[12] H. Blockeel, T. Calders, E. Fromont, et al., “An Inductive Database Prototype Based on Virtual Mining Views”, Data Min. Knowl. Discov, 24:1 (2012), 247–287 | DOI

[13] H. Blockeel, T. Calders, E. Fromont, et al., “Inductive Querying with Virtual Mining Views”, Inductive Databases and Constraint-Based Data Mining, 2010, 265–287, Springer | DOI

[14] S. Brin, R. Motwani, J. D. Ullman, S. Tsur, “Dynamic Itemset Counting and Implication Rules for Market Basket Data”, Proceedings ACM SIGMOD International Conference on Management of Data, SIGMOD 1997 (May 13–15, 1997, Tucson, Arizona, USA), 1997, 255–264 | DOI

[15] V. Bogorny, B. Kuijpers, L. O. Alvares, “ST-DMQL: A Semantic Trajectory Data Mining Query Language”, International Journal of Geographical Information Science, 23:10 (2009), 1245–1276

[16] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees, Wadsworth International Group, 1984

[17] T. Calders, L. V. Lakshmanan, R. T. Ng, J. Paredaens, “Expressive Power of an Algebra for Data Mining”, ACM Trans. Database Syst., 31:4 (2006), 1169–1214 | DOI

[18] S. Chakravarthy, S. Pradhan, “DB-FSG: An SQL-based Approach for Frequent Subgraph Mining”, Database and Expert Systems Applications, 19th International Conference, DEXA 2008 (Turin, Italy, September 1–5, 2008), 2008, 684–692 | DOI

[19] S. Chaudhuri, “What Next?: a Half-dozen Data Management Research Goals for Big Data and the Cloud”, Proceedings of the 31st ACM SIGMODSIGACT- SIGART Symposium on Principles of Database Systems, PODS 2012 (Scottsdale, AZ, USA, May 20–24, 2012), 2012, 1–4 | DOI

[20] X. Chen, I. Petrounias, “Language Support for Temporal Data Mining”, Principles of Data Mining and Knowledge Discovery, 2nd European Symposium, PKDD 98 (Nantes, France, September 23–26, 1998), 1998, 282–290 | DOI

[21] E. F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Commun. ACM, 13:6 (1970), 377–387 | DOI

[22] A. Davoudian, L. Chen, M. Liu, “A Survey on NoSQL Stores”, ACM Comput. Surv, 51:2 (2018), 1–43 | DOI

[23] M. Ester, H. Kriegel, J. Sander, X. Xu, “A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, KDD-96 (Portland, Oregon, USA), 1996, 226–231

[24] E. Frank, M. A. Hall, G. Holmes, et al., “WEKA - A Machine Learning Workbench for Data Mining”, The Data Mining and Knowledge Discovery Handbook, 2005, 1305–1314, Springer

[25] W. J. Frawley, G. Piatetsky-Shapiro, C. J. Matheus, “Knowledge Discovery in Databases: an Overview”, Knowledge Discovery in Databases, 1991, 1–30, AAAI/MIT Press

[26] A. Dempster, N. Laird, D. Rubin, “Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm”, Journal of The Royal Statistical Society, 39:1 (1977), 1–38

[27] W. Garcia, C. Ordonez, K. Zhao, P. Chen, “Efficient Algorithms Based on Relational Queries to Mine Frequent Graphs”, Proceedings of the 3rd PhD Workshop on Information and Knowledge Management, PIKM 2010 (Toronto, Ontario, Canada, October 30, 2010), 2010, 17–24 | DOI

[28] S. Guha, N. Mishra, R. Motwani, L. O’Callaghan, “Clustering Data Streams”, Proceedings of the 41st Annual Symposium on Foundations of Computer Science, FOCS 2000 (12–14 November 2000, Redondo Beach, California, USA), 2000, 359–366 | DOI

[29] J. Han, Y. Fu, W. Wang et al., “DBMiner: A System for Mining Knowledge in Large Relational Databases”, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, KDD-96 (Portland, Oregon, USA), 1996, 250–255

[30] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2006, 743 pp.

[31] J. Han, K. Koperski, N. Stefanovic, “GeoMiner: A System Prototype for Spatial Data Mining”, Proceedings ACM SIGMOD International Conference on Management of Data, SIGMOD 1997 (May 13–15, 1997, Tucson, Arizona, USA), 1997, 553–556 | DOI

[32] J. Han, J. Pei, Y. Yin, “Mining Frequent Patterns without Candidate Generation”, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (May 16–18, 2000, Dallas, Texas, USA), 2000, 1–12 | DOI

[33] J. M. Hellerstein, C. Re, F. Schoppmann et al., “The MADlib Analytics Library or MAD Skills, the SQL”, PVLDB, 5:12 (2012), 1700–1711

[34] M. HooshSadat, H. W. Samuel, S. Patel, O. R. Zaiane, “Fastest Association Rule Mining Algorithm Predictor (FARM-AP)”, Proceedings of the 4th International C Conference on Computer Science and Software Engineering, C3S2E 2011 (Montreal, Quebec, Canada, May 16–18, 2011), 2011, 43–50 | DOI

[35] M. A. Houtsma, A. N. Swami, “Set-Oriented Mining for Association Rules in Relational Databases”, Proceedings of the 11th International Conference on Data Engineering (March 6–10, 1995, Taipei, Taiwan), 1995, 25–33 | DOI

[36] Z. Huang, “Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values”, Data Min. Knowl. Discov, 2:3 (1998), 283–304 | DOI

[37] T. Imielinski, A. Virmani, “MSQL: A Query Language for Database Mining”, Data Min. Knowl. Discov, 3:4 (1999), 373–408 | DOI

[38] G. Karypis, V. Kumar, “Analysis of Multilevel Graph Partitioning”, Proceedings of Supercomputing 95 (San Diego, CA, USA, December 4–8, 1995), 1995, 29 | DOI

[39] L. Kaufman, P. J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster Analysis, John Wiley, 1990 | DOI

[40] C. Krause, D. Johannsen, R. Deeb et al., “An SQL-Based Query Language and Engine for Graph Pattern Matching”, Proceedings, Graph Transformation - 9th International Conference ICGT 2016 (in Memory of Hartmut Ehrig, Held as Part of STAF 2016, Vienna, Austria, July 5–6, 2016), 2016, 153–169 | DOI

[41] M. Kowalski, S. Stawicki, “SQL-based Heuristics for Selected KDD Tasks over Large Data Sets”, Proceedings of the FedCSIS 2012, Federated Conference on Computer Science and Information Systems (Wroclaw, Poland, 9–12 September 2012), IEEE, 2012, 303–310

[42] K. Lepinioti, S. McKearney, “Integrating Cobweb with a Relational Database”, Proceedings of the International MultiConference of Engineers and Computer Scientists 2007, IMECS 2007 (March 21–23, 2007, Hong Kong, China), 2007, 868–873

[43] G. Liu, H. Lu, W. Lou et al., “Efficient Mining of Frequent Patterns Using Ascending Frequency Ordered Prefix-Tree”, Data Min. Knowl. Discov, 9:3 (2004), 249–274 | DOI

[44] J. Liu, Y. Pan, K. Wang, J. Han, “Mining Frequent Item Sets by Opportunistic Projection”, Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (July 23–26, 2002, Edmonton, Alberta, Canada), 2002 | DOI

[45] E. O. Lizardo, C. A. Davis, “A PostGIS Extension to Support Advanced Spatial Data Types and Integrity Constraints”, Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, iGIS 2017 (Redondo Beach, CA, USA, November 7–10, 2017), 2017 | DOI

[46] S. P. Lloyd, “Least Squares Quantization in PCM”, IEEE Transactions on Information Theory, 28:2 (1982), 129–136 | DOI

[47] D. Mahajan, J. K. Kim, J. Sacks et al., “In-RDBMS Hardware Acceleration of Advanced Analytics”, PVLDB, 11:11 (2018), 1317–1331

[48] D. S. Matusevich, C. A. Ordonez, “Clustering Algorithm Merging MCMC and EM Methods Using SQL Queries”, Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, BigMine 2014 (New York City, USA, August 24, 2014), 2014, 61–76

[49] D. Malerba, A. Appice, M. Ceci, “A Data Mining Query Language for Knowledge Discovery in a Geographical Information System”, Database Support for Data Mining Applications: Discovering Knowledge with Inductive Queries, 2004, 95–116 | DOI

[50] J. D. McCaffrey, “A Hybrid System for Analyzing Very Large Graphs”, Ninth International Conference on Information Technology: New Generations, ITNG 2012 (Las Vegas, Nevada, USA, April 16–18, 2012), 2012, 253–257 | DOI

[51] R. Meo, G. Psaila, S. Ceri, “A New SQL-like Operator for Mining Association Rules”, Proceedings of 22th International Conference on Very Large Data Bases, VLDB 96 (September 3–6, 1996, Mumbai (Bombay), India), 1996, 122–133

[52] V. Moertini, B. Sitohang, O. S. Santosa, Searching Object-Relational

[53] C. Ordonez, “Statistical Model Computation with UDFs”, IEEE Trans. Knowl. Data Eng., 22:12 (2010), 1752–1765 | DOI

[54] C. Ordonez, “Can We Analyze Big Data Inside a DBMS?”, Proceedings of the 16th International Workshop on Data Warehousing and OLAP, DOLAP 2013 (San Francisco, CA, USA, October 28, 2013), 2013, 85–92 | DOI

[55] C. Ordonez, “Programming the K-means Clustering Algorithm in SQL”, Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Seattle, Washington, USA, August 22–25, 2004), 2004, 823–828 | DOI

[56] C. Ordonez, “Integrating K-Means Clustering with a Relational DBMS Using SQL”, IEEE Trans. Knowl. Data Eng, 18:2 (2006), 188–201 | DOI

[57] C. Ordonez, P. Cereghini, “SQLEM: Fast Clustering in SQL Using the EM Algorithm”, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (May 16–18, 2000, Dallas, Texas, USA), 2000, 559–570 | DOI

[58] C. Ordonez, Z. Chen, “Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis”, IEEE Trans. Knowl. Data Eng, 24:4 (2012), 678–691 | DOI

[59] C. Ordonez, C. Garcia-Alvarado, “A Data Mining System Based on SQL Queries and UDFs for Relational Databases”, Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011 (Glasgow, United Kingdom, October 24–28, 2011), 2011, 2521–2524 | DOI

[60] C. Ordonez, C. Garcia-Alvarado, V. Baladandayuthapani, “Bayesian Variable Selection in Linear Regression in One Pass for Large Datasets”, TKDD, 9:1 (2014), 1–14 | DOI

[61] C. Ordonez, J. Garcia-Garcia, C. Garcia-Alvarado, et al., “Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems”, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013 (New York, NY, USA, June 22–27, 2013), 2013, 1001–1004 | DOI

[62] C. Ordonez, N. Mohanam, C. Garcia-Alvarado, “PCA for Large Data Sets with Parallel Data Summarization”, Distributed and Parallel Databases, 32:3 (2014), 377–403 | DOI

[63] S. Padmanabhan, S. Chakravarthy, “HDB-Subdue: A Scalable Approach to Graph Mining”, Data Warehousing and Knowledge Discovery, 11th International Conference, DaWaK 2009 (Linz, Austria, August 31–September 2, 2009), 2009, 325–338 | DOI

[64] C. Pan, M. Zymbler, “Very Large Graph Partitioning by Means of Parallel DBMS”, Proceedings, Advances in Databases and Information Systems - 17th East European Conference, ADBIS 2013 (Genoa, Italy, September 1–4, 2013), 2013, 388–399 | DOI

[65] J. S. Park, M. Chen, P. S. Yu, “An Effective Hash Based Algorithm for Mining Association Rules”, Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (San Jose, California, May 22–25, 1995), 1995, 175–186 | DOI

[66] R. Rantzau, “Frequent Itemset Discovery with SQL Using Universal Quantification”, Database Support for Data Mining Applications: Discovering Knowledge with Inductive Queries, 2004, 194–213 | DOI

[67] R. Rantzau, L. D. Shapiro, B. Mitschang, Q. Wang, “Algorithms and Applications for Universal Quantification in Relational Databases”, Information Systems, 28:1 (2003), 1–2 | DOI

[68] S. Sarawagi, S. Thomas, R. Agrawal, “Integrating Mining with Relational Database Systems: Alternatives and Implications”, Proceedings ACM SIGMOD International Conference on Management of Data, SIGMOD 1998 (June 2–4, 1998, Seattle, Washington, USA), 1998., 343–354 | DOI

[69] K. U. Sattler, O. Dunemann, “SQL Database Primitives for Decision Tree Classifiers”, Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management (Atlanta, Georgia, USA, November 5–10, 2001), ACM, 2001, 379–386 | DOI

[70] A. Savasere, E. Omiecinski, S. B. Navathe, “An Efficient Algorithm for Mining Association Rules in Large Databases”, Proceedings of 21th International Conference on Very Large Data Bases, VLDB’95 (September 11–15, 1995, Zurich, Switzerland), 1995, 432–444

[71] X. Shang, K. Sattler, I. Geist, “SQL Based Frequent Pattern Mining with FP-Growth”, Applications of Declarative Programming and Knowledge Management, 15th International Conference on Applications of Declarative Programming and Knowledge Management, INAP 2004, and 18th Workshop on Logic Programming, WLP 2004 (Potsdam, Germany, March 4–6, 2004, Revised Selected Papers), 2004, 32–46 | DOI

[72] C. I. Sidlo, A. Lukacs, “Shaping SQL-based Frequent Pattern Mining Algorithms”, Knowledge Discovery in Inductive Databases 4th International Workshop, KDID 2005 (Porto, Portugal, October 3, 2005, Revised Selected and Invited Papers), 2005, 188–201 | DOI

[73] Y. N. Silva, W. G. Aref, M. H. Ali, “Similarity Group-By”, Proceedings of the 25th International Conference on Data Engineering, ICDE 2009 (March 29–April 2, 2009, Shanghai, China), 2009, 904–915 | DOI

[74] S. Srihari, S. Chandrashekar, S. Parthasarathy, “A Framework for SQLBased Mining of Large Graphs on Relational Databases”, Advances in Knowledge Discovery and Data Mining, 14th Pacific-Asia Conference, PAKDD 2010 (Hyderabad, India, June 21–24, 2010, Proceedings, Part II), 2010, 160–167 | DOI

[75] M. Stonebraker, S. Madden, P. Dubey, “Intel Big Data Science and Technology Center Vision and Execution Plan”, SIGMOD Record, 42:1 (2013), 44–49 | DOI

[76] P. Sun, Y. Huang, C. Zhang, “Cluster-By: An Efficient Clustering Operator in Emergency Management Database Systems”, Web-Age Information Management - WAIM 2013 International Workshops: HardB, MDSP, BigEM, TMSN, LQPM, BDMS (Beidaihe, China, June 14–16, 2013), 2013, 152–164 | DOI

[77] P. Tamayo, C. Berger, M. M. Campos et al., “Oracle Data Mining - Data Mining in the Database Environment”, The Data Mining and Knowledge Discovery Handbook, 2005, 1315–1329, Springer

[78] Z. Tang, J. Maclennan, P. P. Kim, “Building Data Mining Solutions with OLE DB for DM and XML for analysis”, SIGMOD Record, 34:2 (2005), 80–85 | DOI

[79] S. Thomas, S. Chakravarthy, “Performance Evaluation and Optimization of Join Queries for Association Rule Mining”, Data Warehousing and Knowledge Discovery, 1st International Conference, DaWaK 99 (Florence, Italy, August 30–September 1, 1999), 1999, 241–250 | DOI

[80] V. Turner, J. Gantz, D. Reinsel, et al., The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things, 2014 } {\tt http://www.emc.com/leadership/

[81] M. J. Zaki, “Scalable Algorithms for Association Mining”, IEEE Trans. Knowl. Data Eng., 12:3 (2000), 372–390 | DOI

[82] F. Wang, J. Gordon, N. Helian, “SQL Implementation of a ScanOnce Algorithm for Large Database Mining”, Engineering Federated Information Systems, Proceedings of the 5th Workshop EFIS 2003 (July 17–18 2003, Coventry, UK), 2003, 43–45

[83] H. Wang, C. Zaniolo, C. Luo, “ATLAS: A Small but Complete SQL Extension for Data Mining and Data Streams”, VLDB, 2003, 1113–1116

[84] W. Wang, J. Yang, R. R. Muntz, “STING: A Statistical Information Grid Approach to Spatial Data Mining”, Proceedings of 23rd International Conference on Very Large Data Bases, VLDB’97 (August 25–29, 1997, Athens, Greece), 1997, 186–195

[85] T. Yoshizawa, I. Pramudiono, M. Kitsuregawa, “SQL Based Association Rule Mining Using Commercial RDBMS (IBM DB2 UBD EEE)”, Data Warehousing and Knowledge Discovery, Second International Conference, DaWaK 2000 (London, UK, September 4–6, 2000), 2000, 301–306 | DOI