Large language models (LLMs) have demonstrated remarkable in-context learning capabilities across diverse applications. In this work, we explore the effectiveness of LLMs for generating realistic synthetic tabular data, identifying key prompt design elements to optimize performance. We introduce EPIC, a novel approach that leverages balanced, grouped data samples and consistent formatting with unique variable mapping to guide LLMs in generating accurate synthetic data across all classes, even for imbalanced datasets. Evaluations on real-world datasets show that EPIC achieves state-of-the-art machine learning classification performance, significantly improving generation efficiency. These findings highlight the effectiveness of EPIC for synthetic tabular data generation, particularly in addressing class imbalance.
Baseline methods exhibit low sensitivity, indicating
that they fail to accurately generate minority class samples due to inherent class biases. In contrast,
our method overcomes these, achieving high sensitivity and balanced performance for all classes.
Our prompt includes repeated data example sets consisting of
feature names and class-balanced groups, with the feature name at the end serving as a trigger for the
LLM to generate realistic synthetic tabular data. The proposed unique variable mapping remaps categorical
values to distinct alphanumeric strings, ensuring clear distinction and variability among variables.
Our method achieves state-of-the-art F1 scores and balanced accuracy across all six datasets.
Our method exhibits robust performance when used with open-source LLM (Llama2 and Mistral).
Example of a data synthesis prompt for the Sick dataset.
@inproceedings{kim2024Bones,
title={EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models},
author={Kim, Jinhee and Kim, Taesung and Choo, Jaegul},
year={2024},
eprint={2404.12404},
archivePrefix={arXiv},
primaryClass={cs.LG}
}