Towards Enhancing Data Science Agents with Semantics
Computer Science and Information Systems, Tome 23 (2026) no. 1

Voir la notice de l'article provenant de la source Computer Science and Information Systems website

Data lakes, initially designed for storing heterogeneous datasets, have recently been extended with ML capabilities to unify data science tasks within a single platform. However, they still lack essential ML-specific features, limiting their effectiveness for end-to-end automation. Automated Machine Learning (AutoML) and Large Language Models (LLMs) offer potential solutions by streamlining various stages of the ML pipeline, yet both have significant limitations. This paper presents an integration of AutoML frameworks and LLMs within a data lake system. We introduce a metadata model to capture data analytics processes, a Python package wrapping existing AutoML libraries, and a module utilizing LLMs to automate ML tasks. A comparative evaluation indicates that AutoML simplifies pipeline creation but limits user control and lacks robust data preprocessing support. LLMs can automate individual tasks, such as code generation, but struggle to orchestrate complete workflows effectively. Both approaches risk staying as basic prototypes that still need manual improvement. The primary challenge lies in managing task interdependencies within ML pipelines. Retrieval-augmented generation enables dynamic access to external information but may overlook structured data relationships, leading to incomplete or redundant results. Therefore, we propose an extended vision that integrates multi-agent frameworks for data science with knowledge graphs that capture historical experience from previous ML experiments. We present preliminary results for developing comprehensive, context-aware ML agents and their integration into our data lake system SEDAR.
Keywords: AutoML, LLMs, Semantic Data Lake, MLOps, Data Science Agents
Sayed Hoseini; Maximilian Ibbels; Maximilian Knoll; Christoph Quix. Towards Enhancing Data Science Agents with Semantics. Computer Science and Information Systems, Tome 23 (2026) no. 1. http://geodesic.mathdoc.fr/item/CSIS_2026_23_1_a20/
@article{CSIS_2026_23_1_a20,
     author = {Sayed Hoseini and Maximilian Ibbels and Maximilian Knoll and Christoph Quix},
     title = {Towards {Enhancing} {Data} {Science} {Agents} with {Semantics}},
     journal = {Computer Science and Information Systems},
     year = {2026},
     volume = {23},
     number = {1},
     url = {http://geodesic.mathdoc.fr/item/CSIS_2026_23_1_a20/}
}
TY  - JOUR
AU  - Sayed Hoseini
AU  - Maximilian Ibbels
AU  - Maximilian Knoll
AU  - Christoph Quix
TI  - Towards Enhancing Data Science Agents with Semantics
JO  - Computer Science and Information Systems
PY  - 2026
VL  - 23
IS  - 1
UR  - http://geodesic.mathdoc.fr/item/CSIS_2026_23_1_a20/
ID  - CSIS_2026_23_1_a20
ER  - 
%0 Journal Article
%A Sayed Hoseini
%A Maximilian Ibbels
%A Maximilian Knoll
%A Christoph Quix
%T Towards Enhancing Data Science Agents with Semantics
%J Computer Science and Information Systems
%D 2026
%V 23
%N 1
%U http://geodesic.mathdoc.fr/item/CSIS_2026_23_1_a20/
%F CSIS_2026_23_1_a20