Workshop
Synthetic Data Generation with Generative AI
Sergul Aydore 路 Zhaozhi Qian 路 Mihaela van der Schaar
Hall E2 (level 1)
Sat 16 Dec, 7 a.m. PST
Synthetic data (SD) is data that has been generated by a mathematical model to solve downstream data science tasks. SD can be used to address three key problems: 1/ private data release, 2/ data de-biasing and fairness, 3/ data augmentation for boosting the performance of ML models. While SD offers great opportunities for these problems, SD generation is still a developing area of research. Systematic frameworks for SD deployment and evaluation are also still missing. Additionally, despite the substantial advances in Generative AI, the scientific community still lacks a unified understanding of how generative AI can be utilized to generate SD for different modalities.The goal of this workshop is to provide a platform for vigorous discussion from all these different perspectives with research communities in the hope of progressing the ideal of using SD for better and trustworthy ML training. Through submissions and facilitated discussions, we aim to characterize and mitigate the common challenges of SD generation that span numerous application domains. The workshop is jointly organized by academic researchers (University of Cambridge) and industry partners from tech (Amazon AI).
Schedule
Sat 7:00 a.m. - 7:05 a.m.
|
Welcome and workshop overview
(
Talk
)
>
SlidesLive Video |
Sergul Aydore 馃敆 |
Sat 7:05 a.m. - 7:15 a.m.
|
Synthetic Data: Charting New Research Frontiers, Maximizing Impact, and Cultivating Collaborative Communities
(
Talk
)
>
SlidesLive Video |
Mihaela van der Schaar 馃敆 |
Sat 7:15 a.m. - 8:00 a.m.
|
Generating health records
(
Invited Talk
)
>
SlidesLive Video |
Edward Choi 馃敆 |
Sat 8:00 a.m. - 8:30 a.m.
|
Coffee Break & Poster Session
(
Poster
)
>
|
馃敆 |
Sat 8:30 a.m. - 9:15 a.m.
|
Privacy and Synthetic data
(
Invited Talk
)
>
SlidesLive Video |
Antti Honkela 馃敆 |
Sat 9:15 a.m. - 9:45 a.m.
|
Differentially Private Synthetic Data via Foundation Model APIs 1: Images ( Contributed Talk ) > link | Zinan Lin 馃敆 |
Sat 9:45 a.m. - 10:15 a.m.
|
Effective Data Augmentation With Diffusion Models
(
Contributed Talk
)
>
link
SlidesLive Video |
Max Gurinas 路 Brandon Trabucco 馃敆 |
Sat 10:15 a.m. - 11:30 a.m.
|
Lunch Break & Poster Session
(
Poster
)
>
|
馃敆 |
Sat 11:30 a.m. - 12:15 p.m.
|
Diversity and Synthetic data
(
Invited Talk
)
>
SlidesLive Video |
Adji Bousso Dieng 馃敆 |
Sat 12:15 p.m. - 12:45 p.m.
|
Fair Wasserstein Coresets
(
Contributed Talk
)
>
SlidesLive Video |
Vamsi Potluru 馃敆 |
Sat 12:45 p.m. - 1:15 p.m.
|
Improving fairness for spoken language understanding in atypical speech with Text-to-Speech
(
Contributed Talk
)
>
SlidesLive Video |
Venkatesh Ravichandran 路 Helin Wang 馃敆 |
Sat 1:15 p.m. - 1:30 p.m.
|
Coffee Break & Poster Session
(
Poster
)
>
|
馃敆 |
Sat 1:30 p.m. - 2:15 p.m.
|
Generative Agents: Interactive Simulacra
(
Invited Talk
)
>
SlidesLive Video |
Michael Bernstein 馃敆 |
Sat 2:15 p.m. - 3:00 p.m.
|
Panel Discussion
(
Panel
)
>
SlidesLive Video |
Danielle Belgrave 路 Cem Tekin 路 Robert Tillman 路 Megan Gibbs 路 Dino Oglic 路 Rudi Agius 路 Panagiota Konstantinou 馃敆 |
-
|
Size Matters: Large Graph Generation with HiGGs ( Poster ) > link | Alex O. Davies 路 Nirav Ajmeri 路 Telmo Silva Filho 馃敆 |
-
|
Generating Medical Instructions with Conditional Transformer ( Poster ) > link | Samuel Belkadi 路 Nicolo Micheletti 路 Lifeng Han 路 Warren Del-Pinto 路 Goran Nenadic 馃敆 |
-
|
ciix: Outperforming GPT3 on Scientific Factual Error Correction ( Poster ) > link | Dhananjay Ashok 路 Atharva Kulkarni 路 Hai Pham 路 Barnabas Poczos 馃敆 |
-
|
Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI ( Poster ) > link | Elena Sizikova 路 Niloufar Saharkhiz 路 Diksha Sharma 路 Miguel Lago 路 Berkman Sahiner 路 Jana Delfino 路 Aldo Badano 馃敆 |
-
|
Knowledge-Infused Prompting Improves Clinical Text Generation with Large Language Models ( Poster ) > link | Ran Xu 路 Hejie Cui 路 Yue Yu 路 Xuan Kan 路 Wenqi Shi 路 Yuchen Zhuang 路 Wei Jin 路 Joyce Ho 路 Carl Yang 馃敆 |
-
|
Improving Code Style for Accurate Code Generation ( Poster ) > link | Naman Jain 路 Tianjun Zhang 路 Wei-Lin Chiang 路 Joseph Gonzalez 路 Koushik Sen 路 Ion Stoica 馃敆 |
-
|
GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning ( Poster ) > link | Amani Namboori 路 Shivam Mangale 路 Andy Rosenbaum 路 Saleh Soltan 馃敆 |
-
|
EDGE++: Improved Training and Sampling of EDGE ( Poster ) > link | Xiaohui Chen 路 Mingyang Wu 路 Liping Liu 馃敆 |
-
|
Conditional Generative Modeling for High-dimensional Marked Temporal Point Processes ( Poster ) > link | Zheng Dong 路 Zekai Fan 路 Shixiang Zhu 馃敆 |
-
|
Synthetic Data Generation for Scarce Road Scene Detection Scenarios ( Poster ) > link | Dipika Khullar 路 Yash Shah 路 Ninadkulamz 路 Negin Sokhandan 馃敆 |
-
|
Stable Diffusion For Aerial Object Detection ( Poster ) > link | Yanan Jian 路 FUXUN YU 路 Simranjit Singh 路 Dimitrios Stamoulis 馃敆 |
-
|
INTAGS: Interactive Agent-Guided Simulation ( Poster ) > link | Song Wei 路 Andrea Coletta 路 Svitlana Vyetrenko 路 Tucker Balch 馃敆 |
-
|
CALICO: Conversational Agent Localization via Synthetic Data Generation ( Poster ) > link |
11 presentersAndy Rosenbaum 路 Ershad Banijamali 路 Christopher DiPersio 路 Pegah Kharazmi 路 Pan Wei 路 Lu Zeng 路 Gokmen Oz 路 Wael Hamza 路 Clement Chung 路 Karolina Owczarzak 路 Fabian Triefenbach |
-
|
Improving fairness for spoken language understanding in atypical speech with Text-to-Speech ( Oral ) > link |
11 presentersHelin Wang 路 Venkatesh Ravichandran 路 Milind Rao 路 Becky Lammers 路 Myra J. Sydnor 路 Nicholas Maragakis 路 Ankur Butala 路 Jayne Zhang 路 Lora Clawson 路 Victoria Chovaz 路 Laureano Moro-Velazquez |
-
|
Generating Privacy-Preserving Longitudinal Synthetic Data ( Poster ) > link | Robin van Hoorn 馃敆 |
-
|
AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing ( Poster ) > link | Namjoon Suh 路 Xiaofeng Lin 路 Din-Yin Hsieh 路 Mehrdad Honarkhah 路 Guang Cheng 馃敆 |
-
|
Towards Effective Synthetic Data Sampling for Domain Adaptive Pose Estimation ( Poster ) > link | Isha Dua 路 Arjun Sharma 路 Shuaib Ahmed 路 Rahul Tallamraju 馃敆 |
-
|
Fair Wasserstein Coresets ( Oral ) > link | Zikai Xiong 路 Niccolo Dalmasso 路 Vamsi Potluru 路 Tucker Balch 路 Manuela Veloso 馃敆 |
-
|
Effective Data Augmentation With Diffusion Models ( Oral ) > link | Brandon Trabucco 路 Kyle Doherty 路 Max Gurinas 路 Russ Salakhutdinov 馃敆 |
-
|
Continuous Diffusion for Mixed-Type Tabular Data ( Poster ) > link | Markus Mueller 路 Kathrin Gruber 路 Dennis Fok 馃敆 |
-
|
Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic Contrast Sets ( Poster ) > link | Brandon Smith 路 Miguel Farinha 路 Siobhan Mackenzie Hall 路 Hannah Rose Kirk 路 Aleksandar Shtedritski 路 Max Bain 馃敆 |
-
|
Harnessing Synthetic Datasets: The Role of Shape Bias in Deep Neural Network Generalization ( Poster ) > link | Elior Benarous 路 Sotiris Anagnostidis 路 Luca Biggio 路 Thomas Hofmann 馃敆 |
-
|
Carpe Diem: On the Evaluation of World Knowledge in Lifelong Language Models ( Oral ) > link | Yujin Kim 路 Jaehong Yoon 路 Seonghyeon Ye 路 Sung Ju Hwang 路 Se-Young Yun 馃敆 |
-
|
Learning to Place Objects into Scenes by Hallucinating Scenes around Objects ( Poster ) > link | Lu Yuan 路 James Hong 路 Vishnu Sarukkai 路 Kayvon Fatahalian 馃敆 |
-
|
Evaluating VLMs for Property-Specific Annotation of 3D Objects ( Poster ) > link | Rishabh Kabra 路 Loic Matthey 路 Alexander Lerchner 路 Niloy Mitra 馃敆 |
-
|
Strong statistical parity through fair synthetic data ( Poster ) > link | Ivona Krchova 路 Michael Platzer 路 Paul Tiwald 馃敆 |
-
|
On the Limitation of Diffusion Models for Synthesizing Training Datasets ( Poster ) > link | Shin'ya Yamaguchi 路 Takuma Fukuda 馃敆 |
-
|
STAR: Improving Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models ( Poster ) > link | Mingyu Derek Ma 路 Xiaoxuan Wang 路 Po-Nien Kung 路 P. Jeffrey Brantingham 路 Nanyun Peng 路 Wei Wang 馃敆 |
-
|
Feedback-guided Data Synthesis for Imbalanced Classification ( Poster ) > link | Reyhane Askari Hemmat 路 Mohammad Pezeshki 路 Florian Bordes 路 Michal Drozdzal 路 Adriana Romero 馃敆 |
-
|
Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization ( Poster ) > link | Prakamya Mishra 路 Zonghai Yao 路 shuwei chen 路 Beining Wang 路 Rohan Mittal 路 Hong Yu 馃敆 |
-
|
Privacy Measurements in Tabular Synthetic Data: State of the Art and Future Research Directions ( Poster ) > link | Alexander Boudewijn 路 Andrea Filippo Ferraris 路 Daniele Panfilo 路 Vanessa Cocca 路 Sabrina Zinutti 路 Karel De Schepper 路 Carlo Chauvenet 馃敆 |
-
|
On Consistent Bayesian Inference from Synthetic Data ( Poster ) > link | Ossi R盲is盲 路 Joonas J盲lk枚 路 Antti Honkela 馃敆 |
-
|
Differentially Private Synthetic Data via Foundation Model APIs 1: Images ( Oral ) > link | Zinan Lin 路 Sivakanth Gopi 路 Janardhan Kulkarni 路 Harsha Nori 路 Sergey Yekhanin 馃敆 |
-
|
Synthetic Health-related Longitudinal Data with Mixed-type Variables Generated using Diffusion Models ( Poster ) > link | Nicholas Kuo 路 Louisa Jorm 路 Sebastiano Barbieri 馃敆 |
-
|
Diffusion-based Semantic-Discrepant Outlier Generation for Out-of-Distribution Detection ( Poster ) > link | Suhee Yoon 路 Sanghyu Yoon 路 Hankook Lee 路 Sangjun Han 路 Ye Seul Sim 路 Kyungeun Lee 路 Hyeseung Cho 路 Woohyung Lim 馃敆 |