An Analysis of Synthetic Data Generation Using GANs for Tabular Data | IJCT Volume 13 – Issue 2 | IJCT-V13I2P24

International Journal of Computer Techniques
ISSN 2394-2231
Volume 13, Issue 2  |  Published: March – April 2026

Author

Sooraj S, Dr. Amrita Parashar

Abstract

Class imbalance is a common problem in many real-world classification tasks where the minority classes contain important information but are underrepresented in the data. Classical ML methods are often prone to bias towards the majority class, and this leads to poor prediction results on the minority classes. The aim of this study is to assess the effectiveness of synthetic data generation in addressing the issue of imbalance between the two methods.Three different datasets of varying sizes were used: a small dataset on customer churn, a medium-sized Adult Census Income dataset, and a large dataset on financial transaction data. Classical machine learning approaches were first trained on the original imbalanced datasets.Subsequently, SMOTE and the three GAN-based models for synthetic tabular data generation, namely CTGAN, TVAE, and CopulaGAN, were used for synthetic minority sample generation. Precision, recall, F1-score, accuracy, and ROC AUC were used for evaluating the performance of the generated datasets. From the experiments, it was observed that SMOTE has the ability to improve minority class performance for small and medium-sized datasets. Although GAN-based models show promise for performance improvement, it was observed that such models are highly sensitive to dataset characteristics, such as dataset size and feature dimensionality, and preprocessing techniques. For large financial datasets, where feature variables are pre-scaled and anonymous, the performance of the GAN models was found to be poor in learning the underlying distribution of the dataset. Overall, it was observed that although GAN-based models show promise for performance improvement, it should be noted that traditional oversampling techniques are always reliable, and the performance of GAN-based models should be evaluated based on dataset characteristics.

Keywords

Class Imbalance, Oversampling, SMOTE, GAN-based Data Augmentation, Tabular Data, Minority Class Prediction.

Conclusion

This project provided a thorough investigation of the classical oversampling techniques and GAN-based synthetic data generation techniques for addressing the imbalance problem while performing tabular classification tasks. The investigation was carried out on three real-world datasets with different sizes, including the customer churn dataset (small-scale dataset), Adult Census Income dataset (medium-scale dataset), and customer transaction dataset (large-scale dataset). From the experiment results, the accuracy of the machine learning models is not sufficient because the accuracy is biased to the majority class in the unbalanced dataset. The inadequacy of the accuracy metric further validates the inadequacy of the accuracy metric in the imbalanced learning scenario.

References

1.Figueira, A.; Vaz, B. Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics 2022, 10, 2733. https://doi.org/10.3390/math10152733 2.Abedi, Masoud, Lars Hempel, Sina Sadeghi, and Toralf Kirsten. 2022. “GAN-Based Approaches for Generating Structured Data in the Medical Domain” Applied Sciences 12, no. 14: 7075. https://doi.org/10.3390/app12147075 3.Mikel Hernandez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, Debbie Rankin, Synthetic data generation for tabular health records: A systematic review,Neurocomputing,Volume 493,2022, (https://www.sciencedirect.com/science/article/pii /S0925231222004349) 4.Generation of synthetic full-scale burst test data for corroded pipelines using the tabular generative adversarial network, Engineering Applications of Artificial Intelligence, Volume 115, 2022, (https://www.sciencedirect.com/science/article/pii /S0952197622003529) 5.Generation of synthetic full-scale burst test data for corroded pipelines using the tabular generative adversarial network, Engineering Applications of Artificial Intelligence, Volume 115, 2022, (https://www.sciencedirect.com/science/article/pii /S0952197622003529) 6.Achuthan, S., Chatterjee, R., Kotnala, S. et al. Leveraging deep learning algorithms for synthetic data generation to design and analyze biological networks. J Biosci 47, 43 (2022). https://doi.org/10.1007/s12038-022-00278-3 7.Mohammad Esmaeilpour, Nourhene Chaalia, Adel Abusitta, Franşois-Xavier Devailly, Wissem Maazoun, Patrick Cardinal, Bi-discriminator GAN for tabular data synthesis, Pattern Recognition Letters, Volume 159, 2022, (https://www.sciencedirect.com/science/article/pii/S 0167865522001830) 8.TabFairGAN: Fair Tabular Data Generation with Generative Adversarial Networks Amirarsalan Rajabi, Ozlem Ozmen Garibay https://doi.org/10.48550/arXiv.2109.00666 9.Dmitry Anshelevich, Gilad Katz, Synthetic tabular data generation using a VAE-GAN architecture, Knowledge-Based Systems, Volume 326, 2025, (https://www.sciencedirect.com/science/article/pii/S 0950705125010421) 10.G.Charbel N. Kindji, Lina M. Rojas-Barahona, Elisa Fromont, Tanguy Urvoy, Tabular data generation models: An in-depth survey and performance benchmarks with extensive tuning, Neurocomputing, Volume 658, 2025, (https://www.sciencedirect.com/science/article/pii/S 0925231225023276) 11.Jian’en Yan, Haihui Huang, Kairan Yang, Haiyan Xu, Yanling Li, Synthetic data for enhanced privacy: A VAE-GAN approach against membership inference attacks, Knowledge-Based Systems, Volume 309, 2025, (https://www.sciencedirect.com/science/article/pii/S 0950705124015338) 12.Subhajit Chatterjee, Debapriya Hazra, Yung-Cheol Byun, GAN-based synthetic time-series data generation for improving prediction of demand for electric vehicles, Expert Systems with Applications, Volume 264, 2025, (https://www.sciencedirect.com/science/article/pii/S 0957417424027052) 13.Lee, S., & Min, M. (2025). CG-TGAN: Conditional Generative Adversarial Networks with Graph Neural Networks for Tabular Data Synthesizing. Proceedings of the AAAI Conference on Artificial Intelligence, 39(17), 18145-18153. https://doi.org/10.1609/aaai.v39i17.33996 14.Saifur Rahman, Shantanu Pal, Shubh Mittal, Tisha Chawla, Chandan Karmakar, SYN-GAN: A robust intrusion detection system using GAN-based synthetic data for IoT security, Internet of Things, Volume 26, 2024, (https://www.sciencedirect.com/science/article/pii /S2542660524001537) 15.Muhammad Ahtazaz Ahsan, Amna Arshad, Adnan Noor Mian, Leveraging tabular GANs for malicious address classification in ethereum network, Computer Networks, Volume 254, 2024, (https://www.sciencedirect.com/science/article/pii /S1389128624006455) 16.Alex X. Wang, Stefanka S. Chukova, Colin R. Simpson, Binh P. Nguyen, Challenges and opportunities of generative models on tabular data, Applied Soft Computing, Volume 166, 2024, (https://www.sciencedirect.com/science/article/pii /S1568494624009979) 17.Vasileios C. Pezoulas, Dimitrios I. Zaridis, Eugenia Mylona, Christos Androutsos, Kosmas Apostolidis, Nikolaos S. Tachos, Dimitrios I. Fotiadis, Synthetic data generation methods in healthcare: A review on open-source tools and methods, Computational and Structural Biotechnology Journal, Volume 23, 2024, (https://www.sciencedirect.com/science/article/pii /S2001037024002393) 18.Saifur Rahman, Shantanu Pal, Shubh Mittal, Tisha Chawla, Chandan Karmakar, SYN-GAN: A robust intrusion detection system using GAN-based synthetic data for IoT security, Internet of Things, Volume 26, 2024, (https://www.sciencedirect.com/science/article/pii /S2542660524001537) 19.Nasimov, R.; Nasimova, N.; Mirzakhalilov, S.; Tokdemir, G.; Rizwan, M.; Abdusalomov, A.; Cho, Y.-I. GAN-Based Novel Approach for Generating Synthetic Medical Tabular Data. Bioengineering 2024, 11, 1288. https://doi.org/10.3390/bioengineering11121288 20.Ha Ye Jin Kang, Erdenebileg Batbaatar, Dong-Woo Choi, Kui Son Choi, Minsam Ko, Kwang Sun Ryu, Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy, JMIR Medical Informatics, Volume 11, 2023, (https://www.sciencedirect.com/science/article/pii /S2291969423000571) 21.Hajra Murtaza, Musharif Ahmed, Naurin Farooq Khan, Ghulam Murtaza, Saad Zafar, Ambreen Bano, Synthetic data generation: State of the art in health care domain, Computer Science Review, Volume 48, 2023, (https://www.sciencedirect.com/science/article/pii /S1574013723000138) 22.Aryan Pathare, Ramchandra Mangrulkar, Kartik Suvarna, Aryan Parekh, Govind Thakur, Aruna Gawade, Comparison of tabular synthetic data generation techniques using propensity and cluster log metric, International Journal of Information Management Data Insights, Volume 3, Issue 2, 2023, (https://www.sciencedirect.com/science/article/pii /S2667096823000241) 23.Fonseca, Joao & Bação, Fernando. (2023). Tabular and latent space synthetic data generation: a literature review. Journal of Big Data. 10. 10.1186/s40537-023-00792-7. 24.Y. Zhang, N. A. Zaidi, J. Zhou and G. Li, “GANBLR: A Tabular Data Generation Model,” 2021 IEEE International Conference on Data Mining (ICDM), Auckland, New Zealand, 2021, pp. 181-190, doi: 10.1109/ICDM51629.2021.00103.

How to Cite This Paper

Sooraj S, Dr. Amrita Parashar (2026). An Analysis of Synthetic Data Generation Using GANs for Tabular Data. International Journal of Computer Techniques, 13(2). ISSN: 2394-2231.

© 2026 International Journal of Computer Techniques (IJCT). All rights reserved.

Submit Your Paper