EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models

Jinhee Kim^*, Taesung Kim^*, Jaegul Choo

Korea Advanced Institute of Science and Technology
^*indicates equal contributions.

Abstract

Large language models (LLMs) have demonstrated remarkable in-context learning capabilities across diverse applications. In this work, we explore the effectiveness of LLMs for generating realistic synthetic tabular data, identifying key prompt design elements to optimize performance. We introduce EPIC, a novel approach that leverages balanced, grouped data samples and consistent formatting with unique variable mapping to guide LLMs in generating accurate synthetic data across all classes, even for imbalanced datasets. Evaluations on real-world datasets show that EPIC achieves state-of-the-art machine learning classification performance, significantly improving generation efficiency. These findings highlight the effectiveness of EPIC for synthetic tabular data generation, particularly in addressing class imbalance.

ML classification performance when adding synthetic data

Baseline methods exhibit low sensitivity, indicating that they fail to accurately generate minority class samples due to inherent class biases. In contrast, our method overcomes these, achieving high sensitivity and balanced performance for all classes.

Overview of our approach

Our prompt includes repeated data example sets consisting of feature names and class-balanced groups, with the feature name at the end serving as a trigger for the LLM to generate realistic synthetic tabular data. The proposed unique variable mapping remaps categorical values to distinct alphanumeric strings, ensuring clear distinction and variability among variables.

Machine learning classification performance

Our method achieves state-of-the-art F1 scores and balanced accuracy across all six datasets.

Classification performance with open-source models

Our method exhibits robust performance when used with open-source LLM (Llama2 and Mistral).

Prompt example

Example of a data synthesis prompt for the Sick dataset.

Bibtex

@inproceedings{
      kim2024epic,
      title={EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models},
      author={Jinhee Kim and Taesung Kim and Jaegul Choo},
      booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
      year={2024},
      url={https://openreview.net/forum?id=d5cKDHCrFJ}
}