Representative sampling algorithm for database systems based on the partitioned parallelism
Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika, Tome 3 (2014) no. 4, pp. 36-50 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

Sampling is a popular approach to very large databases processing in a wide range of applications, e.g. data mining, histograms construction, query execution cost estimation, etc. Use of either the sample instead of the original database can reduce the accuracy of the results, but offset by a reduction of time executing processing. Representative sampling allows you to save the sample of certain characteristics of the database. However, existing algorithms for representative sampling can not be used for pas-parallel database systems because it does not take into account the characteristics of the data distribution fissionable by the compute nodes of the cluster system. In this paper we propose al-representative sampling algorithm for parallel relational database systems based on the slice of parallelism. The results of computational experiments on the proposed algorithm, showing adequate maintenance of representativity database properties distributed across the nodes of a cluster system.
Keywords: relational databases, parallel database systems, representative sampling.
@article{VYURV_2014_3_4_a1,
     author = {D. D. Yantsen and M. L. Zymbler},
     title = {Representative sampling algorithm for database systems based on the partitioned parallelism},
     journal = {Vestnik \^U\v{z}no-Uralʹskogo gosudarstvennogo universiteta. Seri\^a Vy\v{c}islitelʹna\^a matematika i informatika},
     pages = {36--50},
     year = {2014},
     volume = {3},
     number = {4},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/VYURV_2014_3_4_a1/}
}
TY  - JOUR
AU  - D. D. Yantsen
AU  - M. L. Zymbler
TI  - Representative sampling algorithm for database systems based on the partitioned parallelism
JO  - Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika
PY  - 2014
SP  - 36
EP  - 50
VL  - 3
IS  - 4
UR  - http://geodesic.mathdoc.fr/item/VYURV_2014_3_4_a1/
LA  - ru
ID  - VYURV_2014_3_4_a1
ER  - 
%0 Journal Article
%A D. D. Yantsen
%A M. L. Zymbler
%T Representative sampling algorithm for database systems based on the partitioned parallelism
%J Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika
%D 2014
%P 36-50
%V 3
%N 4
%U http://geodesic.mathdoc.fr/item/VYURV_2014_3_4_a1/
%G ru
%F VYURV_2014_3_4_a1
D. D. Yantsen; M. L. Zymbler. Representative sampling algorithm for database systems based on the partitioned parallelism. Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika, Tome 3 (2014) no. 4, pp. 36-50. http://geodesic.mathdoc.fr/item/VYURV_2014_3_4_a1/

[1] P.S. Kostenetskii, A.V. Lepikhov, L.B. Sokolinskii, “Technologies of parallel database systems for hierarchical multiprocessor environments”, Automation and Remote Control, 68:5 (2007), 847–859

[2] C.S. Pan, M.L. Zymbler, “Development of a Parallel Database Management System on the Basis of Open-Source PostgreSQL DBMS”, Bulletin of South Ural State University. Series: Mathematical Modeling, Programming Computer Software, 18(277):12 (2012), 112–120

[3] L.B. Sokolinsky, “Survey of Architectures of Parallel Database Systems”, Programming and Computer Software, 30:6 (2004), 337–346

[4] D.D. Yantsen, M.L Zymbler, “Representative sampling algorithm for parallel database systems”, Parallel Computational Technologies (PCT'2014), Proceedings of the International Scientific Conference (Rostov-on-Don, Russia, April 1–3, 2014), Publishing of the South Ural State University, Chelyabinsk, 2014, 381

[5] S. Acharya, P.B. Gibbons, V. Poosala, S. Ramaswamy, “Join Synopses for Approximate Query Answering”, SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data (June 1-3, 1999, Philadelphia, Pennsylvania, USA), ACM, 1999, 275–286

[6] S. Agarwal, A. Panda, B. Mozafari, A.P. Iyer, S. Madden, I. Stoica, “Blink and It's Done: Interactive Queries on Very Large Data”, Proceedings of the VLDB Endowment, 5:1 (2011), 1902–1905

[7] J. Bisbal, J. Grimson, D.A. Bell, “A Formal Framework for Database Sampling”, Information Software Technology, 47:1 (2005), 819–828

[8] T.S. Buda, “Generation of Test Databases using Sampling Methods”, International Symposium on Software Testing and Analysis, ISSTA'13 (July 15–20, 2013, Lugano, Switzerland), ACM, 2013, 366–369

[9] T.S. Buda, T. Cerqueus, M. Kristiansen, J. Murphy, “VFDS: Very Fast Database Sampling System”, Proceedings of the IEEE 14th International Conference on Information Reuse Integration, IRI 2013 (San Francisco, CA, USA, August 14–16), IEEE, 2013, 153–160

[10] T.S. Buda, T. Cerqueus, J. Murphy, M. Kristiansen, “CoDS: A Representative Sampling Method for Relational Databases”, Database and Expert Systems Applications, 24th International Conference, DEXA 2013, Proceedings, Part I (Prague, Czech Republic, August 26–29), Lecture Notes in Computer Science, 8055, Springer, 2013, 342–356

[11] V.T. Chakaravarthy, V. Pandit, Y. Sabharwal, “Analysis of Sampling Techniques for Association Rule Mining”, Database Theory, ICDT 2009, 12th International Conference, Proceedings (St. Petersburg, Russia, March 23–25, 2009), ACM, 2009, 276–283

[12] S. Chaudhuri, G. Das, U. Srivastava, “Effective Use of Block-Level Sampling in Statistics Estimation”, Proceedings of the ACM SIGMOD International Conference on Management of Data (Paris, France, June 13–18, 2004), ACM, 2004, 287–298

[13] R. Gemulla, P. Rösch, W. Lehner, “Linked Bernoulli Synopses: Sampling along Foreign Keys”, Scientific and Statistical Database Management, 20th International Conference, SSDBM 2008, Proceedings (Hong Kong, China, July 9–11, 2008), Lecture Notes in Computer Science, Springer, 2008, 6–23

[14] B. Goethals, W.L. Page, M. Mampaey, “Mining Interesting Sets and Rules in Relational Databases”, Proceedings of the 2010 ACM Symposium on Applied Computing (SAC) (Sierre, Switzerland, March 22–26), ACM, 2010, 997– 1001

[15] J. Gryz, J. Guo, L. Liu, C. Zuzarte, “Query Sampling in DB2 Universal Database”, Proceedings of the ACM SIGMOD International Conference on Management of Data (Paris, France, June 13–18), ACM, 2004, 839–843

[16] Guide to the Financial Data Set of the PKDD'99 Discovery Challenge, (data obrascheniya: 24.05.2014) http://lisp.vse.cz/pkdd99/Challenge/berka.htm

[17] P.J. Haas, C. Koenig, “A Bi-Level Bernoulli Scheme for Database Sampling”, Proceedings of the ACM SIGMOD International Conference on Management of Data (Paris, France, June 13–18), ACM, 2004, 275–286

[18] Y.E. Ioannidis, V. Poosala, “Histogram-Based Approximation of Set-Valued Query-Answer”, VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases (September 7–10, 1999, Edinburgh, Scotland, UK), Morgan Kaufmann, 1999, 174–185

[19] G.H. John, P. Langley, “Static Versus Dynamic Sampling for Data Mining”, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96) (Portland, Oregon, USA), AAAI Press, 1996, 367–370

[20] M.S. Lakshmi, P.S. Yu, “Effectiveness of Parallel Joins”, IEEE Transactions on Knowledge and Data Engineering, 2:4 (1990), 410–424

[21] F. Olken, D. Rotem, “Random Sampling from Database Files: A Survey”, Statistical and Scientific Database Management, 5th International Conference SSDBM, Proceedings (Charlotte, NC, USA, April 3–5, 1990), Lecture Notes in Computer Science, Springer, 1990, 92–111

[22] Oracle Database SQL Language Reference, (accessed: 24.05.2014) http://docs.oracle.com/cd/E11882_01/server.112/e41084/statements_10002.htm

[23] C.R. Palmer, C. Faloutsos, “Density Biased Sampling: an Improved Method for Data mining and Clustering”, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (May 16–18, 2000, Dallas, Texas, USA), ACM, 2000, 82–92

[24] C.S. Pan, M.L. Zymbler, “Taming Elephants, or How to Embed Parallelism into PostgreSQL”, Database and Expert Systems Applications, 24th International Conference, DEXA 2013, Proceedings, Part I (Prague, Czech Republic, August 26–29), Lecture Notes in Computer Science, 8055, Springer, 2013, 153–164

[25] S. Parthasarathy, “Efficient Progressive Sampling for Association Rules”, Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002) (9–12 December, 2002, Maebashi City, Japan), IEEE, 2002, 354– 361

[26] X. Wu, Y. Wang, S. Guo, Y. Zheng, “Privacy Preserving Database Generation for Database Application Testing”, Fundamenta Informaticae, 78:1 (2007), 595–612

[27] X. Yin, J. Han, J. Yang, P.S. Yu, “Efficient Classification across Multiple Database Relations: A CrossMine Approach”, IEEE Transactions on Knowledge and Data Engineering, 18:1 (2006), 770–783