Feature engineering pipeline optimisation in AutoML workflow using large language models
Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part IV, Tome 540 (2024), pp. 82-112 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice du chapitre de livre

One important way to achieve more efficient automated machine learning is to involve meta-optimisation for all stages of the pipeline design. In this work, we aim to use large language models for feature engineering steps as both optimisers and domain-knowledge experts. We encode the feature engineering pipeline in natural language as a sequence of atomic operations. Black-box optimisation is implemented by requesting a feature engineering pipeline from the LLM using a prompt consisting of predefined instructions, dataset description, and previously evaluated pipelines. To increase the time efficiency and stability of optimisation, we implement a population-based algorithm to produce a set of pipelines with each LLM response instead of a single one. Multi-step optimisation is attempted to provide the LLM with additional domain knowledge. To analyse the performance of the proposed approach, we conduct a set of experiments on the open datasets. Random search has been chosen as a baseline for the optimisation task. We find that while straightforward results obtained with the gpt-3.5-turbo model are close to the baseline with the same time cost, population-based pipeline generation outperforms the baseline and other approaches. Our results confirm that the proposed approach can increase the overall performance of machine learning models with the same time cost for optimisation and fewer tokens needed to obtain the result.
@article{ZNSL_2024_540_a4,
     author = {I. L. Iov and N. O. Nikitin},
     title = {Feature engineering pipeline optimisation in {AutoML} workflow using large language models},
     journal = {Zapiski Nauchnykh Seminarov POMI},
     pages = {82--112},
     year = {2024},
     volume = {540},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/ZNSL_2024_540_a4/}
}
TY  - JOUR
AU  - I. L. Iov
AU  - N. O. Nikitin
TI  - Feature engineering pipeline optimisation in AutoML workflow using large language models
JO  - Zapiski Nauchnykh Seminarov POMI
PY  - 2024
SP  - 82
EP  - 112
VL  - 540
UR  - http://geodesic.mathdoc.fr/item/ZNSL_2024_540_a4/
LA  - en
ID  - ZNSL_2024_540_a4
ER  - 
%0 Journal Article
%A I. L. Iov
%A N. O. Nikitin
%T Feature engineering pipeline optimisation in AutoML workflow using large language models
%J Zapiski Nauchnykh Seminarov POMI
%D 2024
%P 82-112
%V 540
%U http://geodesic.mathdoc.fr/item/ZNSL_2024_540_a4/
%G en
%F ZNSL_2024_540_a4
I. L. Iov; N. O. Nikitin. Feature engineering pipeline optimisation in AutoML workflow using large language models. Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part IV, Tome 540 (2024), pp. 82-112. http://geodesic.mathdoc.fr/item/ZNSL_2024_540_a4/

[1] A. Doke and M. Gaikwad, “Survey on Automated Machine Learning (AutoML) and Meta Learning”, 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), 2021, 1–5

[2] B. Collins, J. Deng, K. Li, and L. Fei-Fei, “Towards Scalable Dataset Construction: An Active Learning Approach”, Proc. Computer Vision-ECCV 2008, 10th European Conference on Computer Vision (Marseille, France, October 12-18, 2008), v. I, 2008, 86–98 | DOI

[3] N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer, “SMOTE: Synthetic Minority Over-Sampling Technique”, J. Artif. Intell. Res., 16 (2002), 321–357 | DOI | Zbl

[4] S. Krishnan, M. Franklin, K. Goldberg, and E. Wu, BoostClean: Automated Error Detection and Repair for Machine Learning, 2017, arXiv: 1711.01299

[5] S. Krishnan and E. Wu, AlphaClean: Automatic Generation of Data Cleaning Pipelines, 2019, arXiv: 1904.11827

[6] S. Marcel and Y. Rodriguez, “Torchvision: The Machine-Vision Package of Torch”, Proceedings of the 18th ACM International Conference on Multimedia, 2010, 1485–1488 | DOI

[7] A. Jung, ImgAug Documentation, readthedocs.io, Jun. 25, 2019, 2019

[8] A. Buslaev, V. Iglovikov, E. Khvedchenya, A. Parinov, M. Druzhinin, and A. Kalinin, “Albumentations: Fast and Flexible Image Augmentations”, Inf., 11 (2020), 125

[9] M. Munson, “A Study on the Importance of and Time Spent on Different Modeling Steps”, ACM SIGKDD Explor. Newsl, 13 (2012), 65–71 | DOI

[10] A. Maćkiewicz and W. Ratajczak, “Principal Components Analysis (PCA)”, Comput. Geosci, 19 (1993), 303–342 | DOI

[11] P. Xanthopoulos, P. Pardalos, and T. Trafalis, “Linear Discriminant Analysis”, Robust Data Mining, 2013, 27–33 | DOI | MR

[12] X. Chu, J. Morcos, I. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye, “KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing”, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, 1247–1261 | DOI

[13] M. Feurer, K. Eggensperger, S. Falkner, M. Lindauer, and F. Hutter, “Auto-sklearn 2.0: Hands-Free AutoML via Meta-Learning”, J. Mach. Learn. Res., 23 (2022), 11936–11996 | MR

[14] N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola, AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data, 2020, arXiv: 2003.06505

[15] A. Vakhrushev, A. Ryzhkov, M. Savchenko, D. Simakov, R. Damdinov, and A. Tuzhilin, LightAutoML: AutoML Solution for a Large Financial Services Ecosystem, 2021, arXiv: 2109.01528 | Zbl

[16] R. Hyndman and Y. Khandakar, “Automatic Time Series Forecasting: The Forecast Package for R”, J. Stat. Softw., 27 (2008), 1–22 | DOI

[17] H. Li, S. Yu, and J. Prìncipe, “Deep Deterministic Independent Component Analysis for Hyperspectral Unmixing”, ICASSP 2022-IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022, 3878–3882

[18] I. Polonskaia, N. Nikitin, I. Revin, P. Vychuzhanin, and A. Kalyuzhnaya, “Multi-Objective Evolutionary Design of Composite Data-Driven Models”, 2021 IEEE Congress on Evolutionary Computation (CEC), 2021, 926–933

[19] R. Olson and J. Moore, “TPOT: A Tree-Based Pipeline Optimization Tool for Automating Machine Learning”, Workshop on Automatic Machine Learning, 2016, 66–74

[20] N. Nikitin, P. Vychuzhanin, M. Sarafanov, I. Polonskaia, I. Revin, I. Barabanova, G. Maximov, A. Kalyuzhnaya, and A. Boukhanovsky, “Automated Evolutionary Approach for the Design of Composite Machine Learning Pipelines”, Future Gener. Comput. Syst., 2021

[21] H. Song, AutoFE: Efficient and Robust Automated Feature Engineering, Massachusetts Institute of Technology, 2018

[22] T. Swearingen, W. Drevo, B. Cyphers, A. Cuesta-Infante, A. Ross, and K. Veeramachaneni, “ATM: A Distributed, Collaborative, Scalable System for Automated Machine Learning”, 2017 IEEE International Conference on Big Data (Boston, MA, USA), 2017, 151–162

[23] E. LeDell and S. Poirier, “H2O AutoML: Scalable Automatic Machine Learning”, 7th ICML Workshop on Automated Machine Learning (AutoML), 2020

[24] Microsoft, Neural Network Intelligence, 2021 https://github.com/microsoft/nni

[25] C. Wang, X. Chen, C. Wu, and H. Wang, AutoTS: Automatic Time Series Forecasting Model Design Based on Two-Stage Pruning, 2022, arXiv: 2203.14169 | MR

[26] M. Nasir, S. Earle, J. Togelius, S. James, and C. Cleghorn, LLMatic: Neural Architecture Search via Large Language Models and Quality-Diversity Optimization, 2023, arXiv: 2306.01102

[27] H. Dong, Y. Gao, H. Wang, H. Yang, and P. Zhang, Heterogeneous Graph Neural Architecture Search with GPT-4, 2023, arXiv: 2312.08680

[28] N. Hollmann, S. Müller, and F. Hutter, “CAAFE: Combining Large Language Models with Tabular Predictors for Semi-Automated Data Science”, 1st Workshop on the Synergy of Scientific and Machine Learning Modeling @ ICML 2023, 2023

[29] Achiam, J. et al., GPT-4 Technical Report, 2023, arXiv: 2303.08774

[30] C. Yang, X. Wang, Y. Lu, H. Liu, Q. Le, D. Zhou, and X. Chen, Large Language Models as Optimizers, 2023, arXiv: 2309.03409

[31] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, and Others, Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023, arXiv: 2307.09288

[32] A. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. Chaplot, D. Casas, E. Hanna, F. Bressand, and Others, Mixtral of Experts, 2024, arXiv: 2401.04088

[33] G. Zhu, S. Jiang, X. Guo, C. Yuan, and Y. Huang, “Evolutionary Automated Feature Engineering”, Pacific Rim International Conference on Artificial Intelligence, 2022, 574–586

[34] B. Hilprecht, C. Hammacher, E. Reis, M. Abdelaal, and C. Binnig, “DiffML: End-to-End Differentiable ML Pipelines”, Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning, 2023, 1–7

[35] R. Bonidia, A. Santos, B. Almeida, P. Stadler, U. Rocha, D. Sanches, and A. Carvalho, “BioAutoML: Automated Feature Engineering and Meta-Learning to Predict Noncoding RNAs in Bacteria”, Brief. Bioinform, 23 (2022), bbac218 | DOI

[36] G. Zhu, Z. Xu, C. Yuan, and Y. Huang, “DIFER: Differentiable Automated Feature Engineering”, International Conference on Automated Machine Learning, 2022, 1–17

[37] J. Stone, Independent Component Analysis: A Tutorial Introduction, MIT Press, 2004

[38] L. Saul and S. Roweis, An Introduction to Locally Linear Embedding, Unpublished http://www.cs.toronto.edu/r̃oweis/lle/publications.html

[39] H. Rakotoarison, L. Milijaona, A. Rasoanaivo, M. Sebag, and M. Schoenauer, “Learning Meta-Features for AutoML”, ICLR 2022-International Conference on Learning Representations (Spotlight), 2022

[40] V. Lopes, A. Gaspar, L. Alexandre, and J. Cordeiro, “An AutoML-Based Approach to Multimodal Image Sentiment Analysis”, 2021 International Joint Conference on Neural Networks (IJCNN), 2021, 1–9 | Zbl

[41] C. Xue, J. Yan, R. Yan, S. Chu, Y. Hu, and Y. Lin, “Transferable AutoML by Model Sharing over Grouped Datasets”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, 9002–9011

[42] S. Chang, C. Wang, and C. Wang, “Automated Feature Engineering for Fraud Prediction in Online Credit Loan Services”, 2022 13th Asian Control Conference (ASCC), 2022, 738–743

[43] A. Fatima, F. Khan, A. Raza, and A. Kamran, “Automated Feature Synthesis from Relational Database for Data Science Related Problems”, 2018 International Conference on Frontiers of Information Technology (FIT), 2018, 71–75

[44] U. Khurana, D. Turaga, H. Samulowitz, and S. Parthasrathy, “Cognito: Automated Feature Engineering for Supervised Learning”, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), 2016, 1304–1307

[45] G. Katz, E. Shin, and D. Song, “ExploreKit: Automatic Feature Generation and Selection”, 2016 IEEE 16th International Conference on Data Mining (ICDM), 2016, 979–984

[46] U. Khurana, H. Samulowitz, and D. Turaga, “Feature Engineering for Predictive Modeling Using Reinforcement Learning”, Proceedings of the AAAI Conference on Artificial Intelligence, 32 (2018) | DOI

[47] F. Nargesian, H. Samulowitz, U. Khurana, E. Khalil, and D. Turaga, “Learning Feature Engineering for Classification”, IJCAI, 2017, 2529–2535

[48] G. Borboudakis and I. Tsamardinos, “Extending Greedy Feature Selection Algorithms to Multiple Solutions”, Data Min. Knowl. Discov, 35 (2021), 1393–1434 | DOI | MR | Zbl

[49] H. Li, T. Fu, J. Dai, H. Li, G. Huang, and X. Zhu, “AutoLoss-Zero: Searching Loss Functions from Scratch for Generic Tasks”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 1009–1018

[50] C. Thornton, F. Hutter, H. Hoos, and K. Leyton-Brown, “Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms”, Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013, 847–855 | DOI

[51] H. Jin, Q. Song, and X. Hu, “Auto-Keras: An Efficient Neural Architecture Search System”, Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019, 1946–1956 | DOI

[52] V. Dodballapur, R. Calisa, Y. Song, and W. Cai, “Automatic Dropout for Deep Neural Networks”, Neural Information Processing, 27th International Conference, ICONIP 2020, Proceedings (Bangkok, Thailand, November 23-27, 2020), v. III, 2020, 185–196

[53] F. Horn, R. Pack, and M. Rieger, “The AutoFeat Python Library for Automated Feature Engineering and Selection”, Machine Learning and Knowledge Discovery in Databases, International Workshops of ECML PKDD 2019, Proceedings (Würzburg, Germany, September 16-20, 2019), v. I, 2020, 111–120 | DOI

[54] L. Dhamo, F. Carulli, P. Nickl, K. Wegner, V. Hodoroaba, C. Würth, S. Brovelli, and U. Resch-Genger, “Efficient Luminescent Solar Concentrators Based on Environmentally Friendly Cd-Free Ternary AIS/ZnS Quantum Dots”, Adv. Opt. Mater., 9 (2021), 2100587 | DOI

[55] S. Mirchandani, F. Xia, P. Florence, B. Ichter, D. Driess, M. Arenas, K. Rao, D. Sadigh, and A. Zeng, Large Language Models as General Pattern Machines, 2023

[56] M. Feurer, J. van Rijn, A. Kadra, P. Gijsbers, N. Mallik, S. Ravi, A. Mueller, J. Vanschoren, and F. Hutter, OpenML-Python: An Extensible Python API for OpenML, 2020, arXiv: 1911.02490

[57] S. LaValle, M. Branicky, and S. Lindemann, “On the Relationship Between Classical Grid Search and Probabilistic Roadmaps”, Int. J. Robot. Res., 23 (2004), 673–692 | DOI

[58] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2023

[59] L. Prokhorenkova, G. Gusev, A. Vorobev, A. Dorogush, and A. Gulin, CatBoost: Unbiased Boosting with Categorical Features, 2019

[60] T. Ho, “Random Decision Forests”, Proceedings of the 3rd International Conference on Document Analysis and Recognition, 1995, 278–282

[61] C. Cortes and V. Vapnik, “Support-Vector Networks”, Mach. Learn, 20 (1995), 273–297 | Zbl

[62] X. He, K. Zhao, and X. Chu, “AutoML: A Survey of the State-of-the-Art”, Knowl. Based Syst., 212 (2021), 106622 | DOI | MR