EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models

Korea Advanced Institute of Science and Technology
*indicates equal contributions.
MY ALT TEXT

Abstract

Large language models (LLMs) have demonstrated remarkable in-context learning capabilities across diverse applications. In this work, we explore the effectiveness of LLMs for generating realistic synthetic tabular data, identifying key prompt design elements to optimize performance. We introduce EPIC, a novel approach that leverages balanced, grouped data samples and consistent formatting with unique variable mapping to guide LLMs in generating accurate synthetic data across all classes, even for imbalanced datasets. Evaluations on real-world datasets show that EPIC achieves state-of-the-art machine learning classification performance, significantly improving generation efficiency. These findings highlight the effectiveness of EPIC for synthetic tabular data generation, particularly in addressing class imbalance.

ML classification performance when adding synthetic data


Baseline methods exhibit low sensitivity, indicating that they fail to accurately generate minority class samples due to inherent class biases. In contrast, our method overcomes these, achieving high sensitivity and balanced performance for all classes.

MY ALT TEXT

Overview of our approach


Our prompt includes repeated data example sets consisting of feature names and class-balanced groups, with the feature name at the end serving as a trigger for the LLM to generate realistic synthetic tabular data. The proposed unique variable mapping remaps categorical values to distinct alphanumeric strings, ensuring clear distinction and variability among variables.

MY ALT TEXT

Machine learning classification performance


Our method achieves state-of-the-art F1 scores and balanced accuracy across all six datasets.

MY ALT TEXT

Classification performance with open-source models


Our method exhibits robust performance when used with open-source LLM (Llama2 and Mistral).

MY ALT TEXT

Prompt example


Example of a data synthesis prompt for the Sick dataset.

MY ALT TEXT

Bibtex

@inproceedings{kim2024Bones,
  title={EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models},
  author={Kim, Jinhee and Kim, Taesung and Choo, Jaegul},
  year={2024},
   eprint={2404.12404},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}