Over the past few years, there has been an increased interest in the areas of language and image generation within the community. As generated texts by models like GPT-3 start to sound more fluid and natural, and generated images and videos by GAN models appear more realistic, researchers began focusing on qualitative properties of the generated content such as the ability to control its style and structure, or incorporate information from external sources into the output. Such aims are extremely important to make language and image generation useful for human-machine interaction and other real-world applications including machine co-creativity, entertainment, reducing biases or toxicity, and improving conversational agents and personal assistants.
Achieving these ambitious but important goals introduces challenges not only from NLP and Vision perspectives, but also ones that pertain to Machine Learning as a whole, which has witnessed a growing body of research in relevant domains such as interpretability, disentanglement, robustness, and representation learning. We believe that progress towards the realization of human-like language and image generation may benefit greatly from insights and progress in these and other ML areas.
In this workshop, we propose to bring together researchers from the NLP, Vision, and ML communities to discuss the current challenges and explore potential directions for controllable generation and improve its quality, correctness, and diversity. As excitement about language and image generation has significantly increased recently thanks to the advent and improvement of language models, Transformers, and GANs, we feel this is the opportune time to hold a new workshop about this subject. We hope CtrlGen will foster discussion and interaction across communities, and sprout fruitful cross-domain relations that open the door for enhanced controllability in language and image generation.
Mon 8:00 a.m. - 8:10 a.m.
|
Opening Remarks
(
Short Intro
)
SlidesLive Video » We will give a brief introduction to the workshop and controllable generation overall, introduce the organizing committee members, and briefly discuss the plan and schedule for the remainder of the day. |
🔗 |
Mon 8:10 a.m. - 8:30 a.m.
|
Invited Talk #1 - Control in Dialogue: When does it work? (Jason Weston)
(
Invited Talk
)
SlidesLive Video » Title: Control in Dialogue: When does it work? Abstract: We describe various attempts to control dialogue models, including content, style, specificity, response-relatedness, and question-asking, as well as for controlling gender bias and safety. Overall, we observe success in controlling attributes when the controllable skill involves surface-level features, as measured by automatic metrics and human judgments. The challenge for the future, however, is how to have this same success for harder tasks. Bio: Jason Weston is a research scientist at Facebook, NY and a Visiting Research Professor at NYU. He earned his PhD in machine learning at Royal Holloway, University of London and at AT&T Research in Red Bank, NJ (advisors: Alex Gammerman, Volodya Vovk and Vladimir Vapnik) in 2000. From 2000 to 2001, he was a researcher at Biowulf technologies. From 2002 to 2003 he was a research scientist at the Max Planck Institute for Biological Cybernetics, Tuebingen, Germany. From 2003 to 2009 he was a research staff member at NEC Labs America, Princeton. From 2009 to 2014 he was a research scientist at Google, NY. His interests lie in statistical machine learning, with a focus on reasoning, memory, perception, interaction and communication. Jason has published over 100 papers, including best paper awards at ICML and ECML, and a Test of Time Award for his work "A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning", ICML 2008 (with Ronan Collobert). He was part of the YouTube team that won a National Academy of Television Arts & Sciences Emmy Award for Technology and Engineering for Personalized Recommendation Engines for Video Discovery. He was listed as the 16th most influential machine learning scholar at AMiner and one of the top 50 authors in Computer Science in Science. |
🔗 |
Mon 8:30 a.m. - 8:35 a.m.
|
Invited Talk #1 Q&A
(
Short Q&A
)
Please ask questions in the RocketChat! |
🔗 |
Mon 8:35 a.m. - 8:55 a.m.
|
Invited Talk #2 - Disentangling Faithfulness and Extractiveness in Abstractive Summarization (He He)
(
Invited Talk
)
SlidesLive Video » Title: Disentangling faithfulness and extractiveness in abstractive summarization Abstract: Despite recent progress in abstractive summarization, systems still suffer from faithfulness errors. While prior work has proposed various methods that improve faithfulness, it is unclear whether the improvement comes from an increased level of extractiveness of the model outputs (i.e. copying more words from the document) or truly better understanding of the document. In this talk, I will discuss the faithfulness-abstractive trade-off in summarization and a better method for evaluating faithfulness that accounts for the change in extractiveness. We then show that it is possible to mitigate the faithfulness-abstractiveness trade-off by controling the level of extractiveness during generation. Bio: He He is an assistant professor in the Center for Data Science and Courant Institute at New York University. Her research interests include robust language understanding, text generation and interactive NLP systems. She obtained her Ph.D. from University of Maryland, College Park and worked as a post-doc at Stanford University before joining NYU. |
🔗 |
Mon 8:55 a.m. - 9:00 a.m.
|
Invited Talk #2 Q&A
(
Short Q&A
)
Please ask questions in the RocketChat! |
🔗 |
Mon 9:00 a.m. - 9:25 a.m.
|
Invited Talk #3 - Disentanglement for Controllable Image Generation (Irina Higgins)
(
Invited Talk
)
SlidesLive Video » Talk: Disentanglement for Controllable Image Generation Abstract: When it comes to generating diverse and plausible complex visual scenes from interpretable interfaces using deep learning, unsupervised disentangled representation learning can be very helpful. These methods can automatically discover the semantically meaningful attributes of a dataset, and represent them in a human-interpretable low-dimensional representation which can be manipulated to generate a large range of new plausible visual scenes. Disentangled representations are also conducive to semantic analogy making and sample efficient language grounding, which allows diverse language-controlled image manipulation and rendering. In this talk we will cover the strengths and limitations of the current methods for disentangled representation learning, and touch on the frontiers of this line of research where radically new approaches are starting to emerge based on the causal, physics-inspired, geometric and contrastive frameworks. Bio: Irina is a Staff Research Scientist at DeepMind, where she works in the Froniers team. Her work aims to bring together insights from the fields of neuroscience and physics to advance general artificial intelligence through improved representation learning. Before joining DeepMind, Irina was a British Psychological Society Undergraduate Award winner for her achievements as an undergraduate student in Experimental Psychology at Westminster University, followed by a DPhil at the Oxford Center for Computational Neuroscience and Artificial Intelligence, where she focused on understanding the computational principles underlying speech processing in the auditory brain. During her DPhil, Irina also worked on developing poker AI, applying machine learning in the finance sector, and working on speech recognition at Google Research. |
Irina Higgins 🔗 |
Mon 9:25 a.m. - 9:30 a.m.
|
Invited Talk #3 Q&A
(
Short Q&A
)
Please ask questions in the RocketChat! |
🔗 |
Mon 9:30 a.m. - 9:50 a.m.
|
Invited Talk #4 - Neuro-Logic and Differentiable Controls (Yejin Choi)
(
Invited Talk
)
SlidesLive Video » Title: Neuro-Logic and Differentiable Controls Abstract: The key challenge to neural language generation is that language models are essentially a mouth without a brain. In this talk, I’ll discuss how we can make better lemonades out of off-the-shelf neural language models via smarter decoding-time algorithms: discrete logic integration and differentiable reasoning. Bio: Yejin Choi is Brett Helsel Professor at the Paul G. Allen School of Computer Science & Engineering at the University of Washington and also a senior research manager at AI2 overseeing the project Mosaic. Her research focuses on commonsense knowledge and reasoning, language grounding with vision and perception, and AI for social good. She is a co-recepient of the ACL Test of Time award in 2021, the CVPR Longuet-Higgins Prize (test of time award) in 2021, the AAAI Outstanding Paper Award (best paper award) in 2020, the Borg Early Career Award (BECA) in 2018, the inaugural Alexa Prize Challenge in 2017, IEEE AI's 10 to Watch in 2016, and the Marr Prize (best paper award) at ICCV 2013. She received her Ph.D. in Computer Science at Cornell University and BS in Computer Science and Engineering at Seoul National University in Korea. |
🔗 |
Mon 9:50 a.m. - 9:55 a.m.
|
Invited Talk #4 Q&A
(
Short Q&A
)
Please ask questions in the RocketChat! |
🔗 |
Mon 9:55 a.m. - 10:10 a.m.
|
Virtual Coffee/Networking Break
link »
Chat with fellow researchers in GatherTown! |
🔗 |
Mon 10:10 a.m. - 11:30 a.m.
|
Discussion Panel and QA Session
(
Discussion Panel
)
SlidesLive Video » Discussion panel and QA session with the majority of our speakers and two additional panelists: Sebastian Gehrmann and Angela Fan. Some pre-solicited questions are available on Slido for viewing and upvoting/downvoting: https://app.sli.do/event/rmabxoqx Feel free to (and please do) add your own questions either to Slido (using the link above) or the RocketChat! Additional panelist biographies: Sebastian is a researcher at Google Research working on the development and evaluation of controllable and interpretable models for language generation. His work received awards and honorable mentions at IEEE Vis '18 and the ACL'19 and NeurIPS'20 Demo Tracks. He co-organized INLG '19, the EvalNLGEval workshop at INLG '20, and the Generation, Evaluation, and Metrics workshop at ACL'21. Angela Fan is a research scientist at FAIR Paris focusing on text generation for low-resource languages. |
🔗 |
Mon 11:30 a.m. - 12:30 p.m.
|
Virtual Poster Session #1
(
Poster Session
)
link »
Posters for our accepted papers will be available for viewing and discussion with the authors on GatherTown. Note that all posters (but not all authors) will be available at both poster sessions. Also, all camera-ready papers are available for viewing on our workshop website: https://ctrlgenworkshop.github.io/accepted_papers.html |
🔗 |
Mon 12:30 p.m. - 1:30 p.m.
|
Lunch Break
link »
1-hour lunch break. Feel free to attend the attached networking GatherTown. |
🔗 |
Mon 1:30 p.m. - 1:50 p.m.
|
Demonstrations
(
Live-Streamed Demos
)
SlidesLive Video » Interesting demos of controllable generation systems will be live-streamed. Also posted on our workshop website at https://ctrlgenworkshop.github.io/accepted_demos.html |
🔗 |
Mon 1:50 p.m. - 2:10 p.m.
|
Invited Talk #5 - Off the Beaten Path: Domain-Agnostic ML for Controllable Generation and Beyond (Alex Tamkin)
(
Invited Talk
)
SlidesLive Video » Title: Off the Beaten Path: Domain-Agnostic ML for Controllable Generation and Beyond Abstract: In many fields of machine learning, the diversity of data domains studied by researchers is significantly more narrow than the diversity of domains in the real world. This has two disadvantages: 1) Existing methods are domain-specific, and fail to serve many impactful domains, including medical and scientific applications, and 2) Failure to examine a broader diversity of data makes it challenging to uncover broader principles underpinning the success of methods across domains. In this talk, I will discuss some of our work on developing machine learning techniques that operate on a wider diversity of data, including a new modeling framework (viewmaker networks) and benchmark (DABS) for self-supervised learning. I will then turn to controllable generation, discussing our work on controllable generation of molecular edits (C5T5), which leverages techniques from both the NLP and drug design communities. I will conclude by discussing future directions and opportunities for domain-agnostic ML in controllable generation and beyond. Bio: Alex is a fourth-year PhD student in Computer Science at Stanford, advised by Noah Goodman and part of the Stanford NLP Group. His research focuses on better understanding, building, and controlling pretrained models, especially in domain-agnostic and multimodal settings. He is supported by an Open Philanthropy AI Fellowship, and has also spent time at Google Brain and Google Language. |
🔗 |
Mon 2:10 p.m. - 2:15 p.m.
|
Invited Talk #5 Q&A
(
Short Q&A
)
Please ask questions in the RocketChat! |
🔗 |
Mon 2:15 p.m. - 2:35 p.m.
|
Invited Talk #6 - Generating and Editing Images Using StyleGAN and CLIP (Or Patashnik)
(
Invited Talk
)
SlidesLive Video » Title: Generating and Editing Images Using StyleGAN and CLIP Abstract: Recently, there has been an increased interest in leveraging the semantic power of large-scale Contrastive-Language-Image-Pre-training (CLIP) models. Specifically, combining the power of CLIP with the generative power of StyleGAN has led to novel text-driven methods with unprecedented generative performance. Bio: Or Patashnik is a graduate student in the School of Computer Science at Tel Aviv University, under the supervision of Daniel Cohen-Or. Her research is about image generation tasks such as image-to-image translation, image editing, etc. |
Or Patashnik 🔗 |
Mon 2:35 p.m. - 2:40 p.m.
|
Invited Talk #6 Q&A
(
Short Q&A
)
Please ask questions in the RocketChat! |
🔗 |
Mon 2:40 p.m. - 3:00 p.m.
|
Virtual Coffee/Networking Break
link »
Chat with fellow researchers in GatherTown! |
🔗 |
Mon 3:00 p.m. - 4:20 p.m.
|
Virtual Poster Session #2
(
Poster Session
)
link »
Posters for our accepted papers will be available for viewing and discussion with the authors on GatherTown. Note that all posters (but not all authors) will be available at both poster sessions. Also, all camera-ready papers are available for viewing on our workshop website: https://ctrlgenworkshop.github.io/accepted_papers.html |
🔗 |
Mon 4:20 p.m. - 4:40 p.m.
|
Invited Talk #7 - Controllable Text Generation with Multiple Constraints (Yulia Tsvetkov)
(
Invited Talk
)
SlidesLive Video » Title: Controllable Text Generation with Multiple Constraints Abstract: Conditional language generation models produce highly fluent but often unreliable outputs. This motivated a surge of approaches to controlling various attributes of the text that models generate. However, the majority of existing approaches are focused on monolingual settings and on controlling for coarse-grained attributes of text (typically, only one binary attribute). This talk will propose to focus on finer-grained aspects of the generated texts, including in multilingual settings. I will present an algorithm for controllable inference from pretrained models, which aims at rewriting model outputs with multiple sentence-level, fine-grained, monolingual and cross-lingual constraints. I will conclude with discussion of future work. Bio: Yulia Tsvetkov is an assistant professor at the Paul G. Allen School of Computer Science & Engineering at University of Washington. Her research group works on NLP for social good, multilingual NLP, and language generation. The projects are motivated by a unified goal: to extend the capabilities of human language technology beyond individual populations and across language boundaries, thereby enabling NLP for diverse and disadvantaged users, the users that need it most. Prior to joining UW, Yulia was an assistant professor at Carnegie Mellon University and a postdoc at Stanford. Yulia is a recipient of the Okawa research award, Amazon machine learning research award, Google faculty research award, and multiple NSF awards. |
Yulia Tsvetkov 🔗 |
Mon 4:40 p.m. - 4:45 p.m.
|
Invited Talk #7 Q&A
(
Short Q&A
)
Please ask questions in the RocketChat! |
🔗 |
Mon 4:45 p.m. - 5:00 p.m.
|
Best Paper Awards and Closing Remarks
(
Closing Remarks
)
SlidesLive Video » We will present the best paper awards and conclude the workshop. Thanks for attending! |
🔗 |
Mon 5:00 p.m. - 12:00 a.m.
|
GatherTown Open for Continued Socializing
(
Networking and Socializing
)
link »
Continue chatting with other researchers in GatherTown! |
🔗 |
-
|
Sound-Guided Semantic Image Manipulation
(
Poster
)
Semantically meaningful image manipulation often involves laborious manual human examination for each desired manipulation. Recent success suggests that leveraging the representation power of existing Contrastive Language-Image Pre-training (CLIP) models with the generative power of StyleGAN can successfully manipulate a given image driven by textual semantics. Following this, we explore adding a new modality, Sound, which can convey a different view of dynamic semantic information and thus can reinforce control strength over the semantic image manipulation. Our audio encoder is trained to produce a latent representation from an audio input, which is forced to be aligned with image and text representations in the same CLIP embedding space. Given such aligned embeddings, we use a direct latent optimization method so that an input image is modified in response to a user-provided sound input. We quantitatively and qualitatively demonstrate the effectiveness of our approach, and we observe our sound-guided image manipulation approach can produce semantically meaningful images. |
SEUNG HYUN LEE · Sang Ho Yoon · Jinkyu Kim · Sangpil Kim 🔗 |
-
|
PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided Decoding
(
Poster
)
Large pre-trained language models (LM) based on Transformers allow to generate very plausible long texts. In this paper, we explore how this generation can be further controlled to satisfy certain constraints (eg. being non-toxic, positive or negative, convey certain emotions, etc.) without fine-tuning the LM. Precisely, we formalize constrained generation as a tree exploration process guided by a discriminator according to how well the associated sequence respects the constraint. Using a discriminator to guide this generation, rather than fine-tuning the LM, in addition to be easier and cheaper to train, allows to apply the constraint more finely and dynamically. We propose several original methods to search this generation tree, notably the Monte Carlo Tree Search (MCTS) which provides theoretical guarantees on the search efficiency, but also simpler methods based on re-ranking a pool of diverse sequences using the discriminator scores. We evaluate these methods on two types of constraints and languages: review polarity and emotion control in French and English. We show that MCTS achieves state-of-the-art results in constrained generation, without having to tune the language model, in both tasks and languages. We also demonstrate that our other proposed methods based on re-ranking can be really effective when diversity among the generated propositions is encouraged. |
Antoine Chaffin · Vincent Claveau · Ewa Kijak 🔗 |
-
|
Hamiltonian prior to Disentangle Content and Motion in Image Sequences
(
Poster
)
We present a deep latent variable model for high dimensional sequential data. Our model factorises the latent space into content and motion variables. To model the diverse dynamics, we split the motion space into subspaces, and introduce a unique Hamiltonian operator for each subspace. The Hamiltonian formulation provides reversible dynamics that learn to constrain the motion path to conserve invariant properties. The explicit split of the motion space decomposes the Hamiltonian into symmetry groups and gives long-term separability of the dynamics. This split also means representations can be learnt that are easy to interpret and control. We demonstrate the utility of our model for swapping the motion of two videos, generating sequences of various actions from a given image and unconditional sequence generation. |
Asif Khan · Amos Storkey 🔗 |
-
|
Controllable Paraphrase Generation with Multiple Types of Constraints
(
Poster
)
One of the current challenges in paraphrase generation is the ability to enforce linguistic constraints on the desired output. These constraints may relate to the length of the output sentence, its syntactic structure, the presence of specific words, etc. While recent works focus on these constraints in isolation, this paper studies a variety of constraints imposed separately or in combination with one another. These constraints include different linguistic factors (surface words, syntax and semantics) of the input sequence or output sequence or both, and different data shapes (scalar value, sequence and tree). These constraints are integrated in a paraphrase generation process using an attention-based encoder-decoder model trained and experimented on the ParaNMT-50M corpus. The results show that the constraints are well respected by the models and that they allow to improve the quality of the produced paraphrases. This multiple constraint-driven model opens a new window for controllable paraphrase generation. The code is publicly available: https://gitlab.inria.fr/expression/tremolo/controllable-paraphrase-generation . |
Gwénolé Lecorvé 🔗 |
-
|
Controlled Cue Generation for Play Scripts
(
Poster
)
In this paper, we use a large-scale play scripts dataset to propose the novel task of theatrical cue generation from dialogues. Using over one million lines of dialogue and cues, we approach the problem of cue generation as a controlled text generation task, and show how cues can be used to enhance the impact of dialogue using a language model conditioned on a dialogue/cue discriminator. In addition, we explore the use of topic keywords and emotions for controlled text generation. Extensive quantitative and qualitative experiments show that language models can be successfully used to generate plausible and attribute-controlled texts in highly specialised domains such as play scripts. |
Alara Dirik · Hilal Dönmez · Pinar Yanardag 🔗 |
-
|
Controlling Conditional Language Models with Distributional Policy Gradients
(
Poster
)
Machine learning is shifting towards general-purpose pretrained generative models, trained in a self-supervised manner on large amounts of data, which can then be applied to solve a large number of tasks. However, due to their generic training methodology, these models often fail to meet some of the downstream requirements (e.g. hallucination in abstractive summarisation or wrong format in automatic code generation). This raises an important question on how to adapt pre-trained generative models to a new task without destroying its capabilities. Recent work has suggested to solve this problem by representing task-specific requirements through energy-based models (EBMs) and approximating these EBMs using distributional policy gradients (DPG). Unfortunately, this approach is limited to unconditional distributions, represented by unconditional EBMs. In this paper, we extend this approach to conditional tasks by proposing Conditional DPG (CDPG). We evaluate CDPG on three different control objectives across two tasks: summarization with T5 and code generation with GPT-Neo. Our results show that fine-tuning using CDPG robustly moves these pretrained models closer towards meeting control objectives and --- in contrast with baseline approaches --- does not result in catastrophic forgetting. |
Tomasz Korbak · Hady Elsahar · Germán Kruszewski · Marc Dymetman 🔗 |
-
|
Towards Unsupervised Content Disentanglement in Sentence Representations via Syntactic Roles
(
Poster
)
Linking neural representations to linguistic factors is crucial in order to build and analyze NLP models interpretable by humans. Among these factors, syntactic roles (e.g. subjects, direct objects, ...) and their realizations are essential markers since they can be understood as a decomposition of predicative structures and thus the meaning of sentences. Starting from a deep probabilistic generative model with attention, we measure the interaction between latent variables and realizations of syntactic roles, and show that it is possible to obtain, without supervision, representations of sentences where different syntactic roles correspond to clearly identified different latent variables. The probabilistic model we propose is an Attention-Driven Variational Autoencoder (ADVAE). Drawing inspiration from Transformer-based machine translation models, ADVAEs enable the analysis of the interactions between latent variables and input tokens through attention. We also develop an evaluation protocol to measure disentanglement with regard to the realizations of syntactic roles. This protocol is based on attention maxima for the encoder and on disturbing individual latent variables for the decoder. Our experiments on raw English text from the SNLI dataset show that i) disentanglement of syntactic roles can be induced without supervision, ii) ADVAE separates more syntactic roles than classical sequence VAEs, iii) realizations of syntactic roles can be separately modified in sentences by mere intervention on the associated latent variables. Our work constitutes a first step towards unsupervised controllable content generation. The code for our work is publicly available. |
Ghazi FELHI · Joseph Roux · Djame Seddah 🔗 |
-
|
Diamond in the rough: Improving image realism by traversing the GAN latent space
(
Poster
)
In just a few years, the photo-realism of images synthesized by Generative Adversarial Networks (GANs) has gone from somewhat reasonable to almost perfect largely by increasing the complexity of the networks, e.g., adding layers, intermediate latent spaces, style-transfer parameters, etc. This trajectory has led many of the state-of-the-art GANs to be inaccessibly large, disengaging many without large computational resources. Recognizing this, we explore a method for squeezing additional performance from existing, low-complexity GANs. Formally, we present an unsupervised method to find a direction in the latent space that aligns with improved photo-realism. Our approach leaves the network unchanged while enhancing the fidelity of the generated image. We use a simple generator inversion to find the direction in the latent space that results in the smallest change in the image space. Leveraging the learned structure of the latent space, we find moving in this direction corrects many image artifacts and presents a more realistic image. We verify our findings qualitatively and quantitatively, showing an improvement in Frechet Inception Distance (FID) exists along our trajectory which surpasses the original GAN and other approaches including a supervised method. We expand further and provide an optimization method to automatically select latent vectors along the path that balance the variation and realism of samples. We apply our method to several diverse datasets and three architectures of varying complexity to illustrate the generalizability of our approach. By expanding the utility of low-complexity and existing networks, we hope to encourage the democratization of GANs. |
Jeffrey Wen · Fabian Benitez-Quiroz · Qianli Feng · Aleix Martinez 🔗 |
-
|
Sampling from Discrete Energy-Based Models with Quality/Efficiency Trade-offs
(
Poster
)
Energy-Based Models (EBMs) allow for extremely flexible specifications of probability distributions. However, they do not provide a mechanism for obtaining exact samples from these distributions. Monte Carlo techniques can aid us in obtaining samples if some proposal distribution that we can easily sample from is available. For instance, rejection sampling can provide exact samples but is often difficult or impossible to apply due to the need to find a proposal distribution that upper-bounds the target distribution everywhere. Approximate Markov chain Monte Carlo sampling techniques like Metropolis-Hastings are usually easier to design, exploiting a local proposal distribution that performs local edits on an evolving sample. However, these techniques can be inefficient due to the local nature of the proposal distribution and do not provide an estimate of the quality of their samples. In this work, we propose a new approximate sampling technique, Quasi Rejection Sampling (QRS), that allows for a trade-off between sampling efficiency and sampling quality, while providing explicit convergence bounds and diagnostics. QRS capitalizes on the availability of high-quality global proposal distributions obtained from deep learning models. We demonstrate the effectiveness of QRS sampling for discrete EBMs over text for the tasks of controlled text generation with distributional constraints and paraphrase generation. We show that we can sample from such EBMs with arbitrary precision at the cost of sampling efficiency. |
Bryan Eikema · Germán Kruszewski · Hady Elsahar · Marc Dymetman 🔗 |
-
|
XCI-Sketch: Extraction of Color Information from Images for Generation of Colored Outlines and Sketches
(
Poster
)
Sketches are a medium to convey a visual scene from an individual's creative perspective. The addition of color substantially enhances the overall expressivity of a sketch. This paper proposes two methods to mimic human-drawn colored sketches by utilizing the Contour Drawing Dataset. Our first approach renders colored outline sketches by applying image processing techniques aided by k-means color clustering. The second method uses a generative adversarial network to develop a model that can generate colored sketches from previously unobserved images. We assess the results obtained through quantitative and qualitative evaluations. |
V MANUSHREE · Sameer Saxena · Parna Chowdhury · MANISIMHA VARMA MANTHENA · Harsh Rathod · Ankita Ghosh · Sahil Khose 🔗 |
-
|
Continuous Emotion Transfer Using Kernels
(
Poster
)
Style transfer is a central problem of machine learning with numerous successful applications. In this work, we present a novel style transfer framework building upon infinite task learning and vector-valued reproducing kernel Hilbert spaces. We consider style transfer as a functional output regression task where the goal is to transform the input objects to a continuum of styles. The learnt mapping is governed by the choice of two kernels, one on the object space and one on the style space, providing flexibility to the approach. We instantiate the idea in emotion transfer where facial landmarks play the role of objects and styles correspond to emotions. The proposed approach provides a principled way to gain explicit control over the continuous style space, allowing to transform landmarks to emotions not seen during the training phase. We demonstrate the efficiency of the technique on popular facial emotion benchmarks, achieving low reconstruction cost. |
Alex Lambert · Sanjeel Parekh · Zoltan Szabo · Florence d'Alché-Buc 🔗 |
-
|
Self-supervised Enhancement of Latent Discovery in GANs
(
Poster
)
Several methods for discovering interpretable directions in the latent space of pre-trained GANs have been proposed. Latent semantics discovered by unsupervised methods are relatively less disentangled than supervised methods since they do not use pre-trained attribute classifiers. We propose Scale Ranking Estimator (SRE), which is trained using self-supervision. SRE enhances the disentanglement in directions obtained by existing unsupervised disentanglement techniques. These directions are updated to preserve the ordering of variation within each direction in latent space. Qualitative and quantitative evaluation of the discovered directions demonstrates that our proposed method significantly improves disentanglement in various datasets. We also show that the learned SRE can be used to perform Attribute-based image retrieval task without further training. |
ADARSH KAPPIYATH · Silpa Vadakkeeveetil Sreelatha · 🔗 |
-
|
Learning to Compose Visual Relations
(
Poster
)
The visual world around us can be described as a structured set of objects and their associated relations. An image of a room may be conjured given only the description of the underlying objects and their associated relations. While there has been significant work on designing deep neural networks which may compose individual objects together, less work has been done on composing the individual relations between objects. A principal difficulty is that while the placement of objects is mutually independent, their relations are entangled and dependent on each other. To circumvent this issue, existing works primarily compose relations by utilizing a holistic encoder, in the form of text or graphs. In this work, we instead propose to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner. We show that such a factorized decomposition allows the model to both generate and edit scenes that have multiple sets of relations more faithfully. We further show that decomposition enables our model to effectively understand the underlying relational scene structure. |
Nan Liu · Shuang Li · Yilun Du 🔗 |
-
|
Learning Representations for Zero-Shot Image Generation without Text
(
Poster
)
DALL-E has shown an impressive ability to generate novel --- significantly and systematically different from the training distribution --- yet realistic images. This is possible because it utilizes the dataset of text-image pairs where the text provides the source of compositionality. Following this result, an important extending question is whether this compositionality can still be achieved even without conditioning on text. In this paper, we propose a simple but novel slot-based autoencoding architecture, called SLATE, that achieves this text-free DALL-E by learning compositional slot-based representations purely from images, an ability lacking in DALL-E. Unlike existing object-centric representation models that decode pixels independently for each slot and each pixel location and compose them via mixture-based alpha composition, we propose to use the Image GPT decoder conditioned on the slots for a more flexible generation by capturing complex interaction among the pixels and the slots. In experiments, we show that this simple architecture achieves zero-shot generation of novel images without text and better quality in generation than the models based on mixture decoders. |
Gautam Singh · Fei Deng · Sungjin Ahn 🔗 |
-
|
C^3: Contrastive Learning for Cross-domain Correspondence in Few-shot Image Generation
(
Poster
)
Few-shot image generation is a task of generating high-quality and diverse images well fitted to the target domain. The generative model should adapt from the source domain to the target domain given a few images. Despite recent progresses in generative models, cutting edge generative models (e.g., GANs) still suffer from synthesizing high-quality and diverse images in few-shot setting. One of the biggest hurdles is that the number of images from the target domain is too small to approximate the true distribution of the target domain. To this end, the effective approach for the few-shot adaption is required to address the problem. In this paper, we propose a simple yet effective method C^3, Contrastive Learning for Cross-domain Correspondence. C^3 method constitutes the positive and negative pairs of images from two different domains and makes the generative model learn the cross-domain correspondence (i.e., semantic mapping from the source domain to the target domain) explicitly via contrastive learning. As a result, our proposed method generates more realistic and diverse images compared to the baseline methods and outperforms the state-of-the-art approaches on photorealistic and non-photorealistic domains. |
Hyukgi Lee · Gi-Cheon Kang · Chang-Hoon Jeong · Hanwool Sul · Byoung-Tak Zhang 🔗 |
-
|
MIDI-DDSP: Hierarchical Modeling of Music for Detailed Control
(
Poster
)
Musical expression requires control of both what notes that are played, and how they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control. Starting from interpretable Differentiable Digital Signal Processing (DDSP) synthesis parameters, we infer musical notes and high-level properties of their expressive performance (such as timbre, vibrato, dynamics, and articulation). This creates a 3-level hierarchy (notes, performance, synthesis) that affords individuals the option to intervene at each level, or utilize trained priors (performance given notes, synthesis given performance) for creative assistance. Through quantitative experiments and listening tests, we demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence. By utilizing an interpretable hierarchy, with multiple levels of granularity, MIDI-DDSP opens the door to assistive tools to empower individuals across a diverse range of musical experience. |
Yusong Wu · Ethan Manilow · Kyle Kastner · Tim Cooijmans · Aaron Courville · Cheng-Zhi Anna Huang · Jesse Engel 🔗 |
-
|
Neural Abstructions: Abstractions that Support Construction for Grounded Language Learning
(
Poster
)
Although virtual agents are increasingly situated in environments where natural language is the most effective mode of interaction with humans, these exchanges are rarely used as an opportunity for learning. Leveraging language interactions effectively requires addressing limitations in the two most common approaches to language grounding: semantic parsers built on top of fixed object categories are precise but inflexible and end-to-end models are maximally expressive, but fickle and opaque. Our goal is to develop a system that balances the strengths of each approach so that users can teach agents new instructions that generalize broadly from a single example. We introduce the idea of neural abstructions: a set of constraints on the inference procedure of a label-conditioned generative model that can affect the meaning of the label in context. Starting from a core programming language that operates over abstructions, users can define increasingly complex mappings from natural language to actions. We show that with this method a user population is able to build a semantic parser for an open-ended house modification task in Minecraft. The semantic parser that results is both flexible and expressive: the percentage of utterances sourced from redefinitions increases steadily over the course of 191 total exchanges, achieving a final value of 28% |
Kaylee Burns · Christopher D Manning · Li Fei-Fei 🔗 |
-
|
Fair Data Generation using Language Models with Hard Constraints
(
Poster
)
Natural language text generation has seen significant improvements with the advent of pre-trained language models. Using such language models to predict personal data entities, in place of redacted spans in text, could help generate synthetic datasets. In order to address privacy and ethical concerns with such datasets, we need to ensure that the masked entity predictions are also fair and controlled by application specific constraints. We introduce new ways to inject hard constraints and knowledge into the language models that address such concerns and also improve performance on this task. |
SK Mainul Islam · Abhinav Nagpal · Balaji Ganesan · Pranay Lohia 🔗 |
-
|
Robust Text Generation using Sequence-to-Sequence Pre-Training
(
Poster
)
Large Transformer-based models have shown great performance in sequence-to-sequence tasks such as machine translation, text summarization etc. While these models perform well on the original task they have been trained on, it is hard to use them for a new but related task. We propose CASPer, a framework to perturb the input-output behavior of the original pre-trained sequence-to-sequence model. CASPer learns a perturbation parameter at test time to modify the behavior of pre-trained model and generates samples that have target characteristics. We apply this framework on a pre-trained text summarization model to alter a given input text such that the generated text has a changed sentiment or other attributes. In experiments, we show that CASPer effectively generates controlled text that preserve the original content, are fluent, diverse and follow the steering provided by the attribute model. We also show that the generated text from CASPer can be used for effective data augmentation for a downstream task. |
Nishtha Madaan · · Srikanta Bedathur 🔗 |
-
|
LUMINOUS: Indoor Scene Generation for Embodied AI Challenges
(
Poster
)
Learning-based methods for training embodied agents typically require a large number of high-quality scenes that contain realistic layouts and support meaningful interactions. However, current simulators for Embodied AI (EAI) challenges only provide simulated indoor scenes with a limited number of layouts. This paper presents LUMINOUS, the first research framework that employs state-of-the-art indoor scene synthesis algorithms to generate large-scale simulated scenes for Embodied AI challenges. Further, we automatically and quantitatively evaluate the quality of generated indoor scenes via their ability to support complex household tasks. LUMINOUS incorporates a novel scene generation algorithm (Constrained Stochastic Scene Generation (CSSG)), which achieves competitive performance with human-designed scenes. Within LUMINOUS, the EAI task executor, task instruction generation module, and video rendering toolkit can collectively generate a massive multimodal dataset of new scenes for the training and evaluation of Embodied AI agents. Extensive experimental results demonstrate the effectiveness of the data generated by LUMINOUS, enabling the comprehensive assessment of embodied agents on generalization and robustness. |
Yizhou Zhao · Kaixiang Lin · Zhiwei Jia · Qiaozi Gao · Govindarajan Thattai · Jesse Thomason · Gaurav Sukhatme 🔗 |