Text-to-image synthesis algorithms, such as DALL-E, have demonstrated an extraordinary ability to transform an input caption into a coherent image. Several recent techniques have also used multimodal solid models to create artistic representations of input captions, proving their ability to democratize art. However, these models are only intended to analyze a single, brief input legend. To capture the meaning of the input language, many text-to-image synthesis use cases require models to handle extended narratives and metaphorical sentences, condition existing visuals, and create more than one image. Several works have already built specific models of generative adversarial networks (GANs) such as image-to-image translation, style transfer, etc.
Visualizing stories is a challenging endeavor that combines the production of images and the understanding of stories. However, the recent introduction of large pre-trained transformer-based models opens possibilities to more efficiently exploit latent knowledge from large-scale pre-trained datasets to perform these specialized tasks in a paradigm similar to the adjustment of pre-trained language models to perform downstream tasks based on language understanding. Accordingly, we investigate approaches to adapt a pre-trained text-image synthesis model for complex downstream applications, with a focus on story visualization, in this study. Storytelling methods, for example, turn a series of captions into a series of images that describe the story.
While previous work in narrative visualization has highlighted potential uses, the work offers specific challenges when applied to real-world scenarios. An agent must create an identical sequence of images that displays the contents of a set of captions that make up a tale. The model is limited to the fixed set of characters, places, and events it was trained on before. He doesn’t understand how to represent a new character that appears in a caption during testing; the captions do not contain enough information to adequately characterize the character’s appearance. Therefore, for the model to generalize to new parts of history, it must include a method to gather more information about how these elements should be graphed. For starters, they’re making narrative visualization more suitable for these use cases by introducing new work called story continuation.
They present a starting scenario that can be obtained in real use situations in this work. They give DiDeMoSV a new visualization dataset and adapt two existing visualization datasets, PororoSV and FlintstonesSV, to the narrative continuation scenario. The model can then replicate and adjust the components of this scene as it creates successive photos by adding them (see figure below). It benefits by diverting attention from text-to-image creation, a hotly debated topic. Instead, it diverts attention to the narrative structure of a sequence of images, such as how an image should evolve to reflect new narrative material in captions.
To adopt a text-image synthesis model for this storytelling continuation work, they must first refine a pre-trained model (such as DALL-E) on a sequential text-image generation task with the added flexibility of copy from a previous entry. . To do this, they first modernized the model using additional layers that duplicate the vital output of the initial scene. Then, during the production of each image, they integrate a block of self-attention to build narrative embeddings that provide an overall semantic context for the tale. The model is refined on the challenge of storytelling continuation, where these additional modules are learned from scratch. For the continuation of the tale, they call their technique StoryDALL-E and compare it to a GAN-based model called StoryGANc.
They also explore the efficient architecture of prompt tuning parameters and present a prompt consisting of task-specific integrations to prompt the pre-trained model to generate visuals for the target domain. Pre-trained weights are frozen while training this quick-set version of the model, and new settings are learned from scratch, saving time and memory. The results suggest that their upgrade strategy in StoryDALL-E effectively exploits the pre-trained knowledge of DALL-latent E for the tale continuation problem, outperforming the GAN-based model on various criteria. Additionally, they found that the copy technique allows for improved generation under low-resource circumstances and produces invisible characters during inference.
In summary, they present a new story continuation dataset and introduce story continuation work, which is more closely related to real-world downstream applications for narrative visualization.
- They provide StoryDALL-E, a modernized adaptation of pre-trained Transformers for the continuation of the tale. They also create StoryGANc to serve as a robust GAN benchmark for comparison.
- They perform comparison tests and ablations to demonstrate that refined StoryDALL-E outperforms StoryGANc on three sets of narrative continuation data across multiple parameters.
- Their investigation demonstrates that the copy increases the correlation of the images produced with the original image, which improves the visual continuity of the story and the development of low-resource and unnoticed characters.
The code implementation in PyTorch can be found freely on GitHub.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link. Please Don't Forget To Join Our ML Subreddit
Consultant intern in content writing at Marktechpost.