📯CVPR 2024 | Dynamic Prompt Optimizing for Text-to-Image Generation
type
status
date
slug
summary
tags
category
icon
password
Contribution
- Dynamic fine-control prompt editing framework
- Effective results
- improve image aesthetics
- ensure semantic consistency between prompts and generated images
- align more closely with human preferences
- Insightful findings
- artist names and texture-related modifiers enhance the artistic quality of generated images
- it is more effective to introduce these terms in the latter half of the diffusion process
- assigning a lower weight to complex terms promotes a more balanced image generation
Task
Input
- a pre-trained text-to-image generative model
- user input text
Output
- a modified prompt with fine-grained control, so that the generated image, , exhibits enhanced visual effects while remaining faithful to the semantics of the initial prompt .
where indicates the append operation.
Method
Dynamic Fine-control Prompt, DF-Prompt
is coupled with an effect range and a specific weight , resulting in a triple .
- weights the token embeddings for controlling the overall influences of token during generation.
- is the normalized range that delineates the start and end steps during the iterative denoising process.
Define
Overview
Stage 1: Plain Prompt Refinement,
Given
- plain input prompt
Predict
- suffix modifiers one by one, until the model outputs the stop sign.
- i.e. construct
where .
Data Selection (Construct Dataset)
- start with a given prompt from publicly available prompt
- is split at a division point (the first comma in ).
- obtain the short prompt .
- remaining tokens form the modifier set .
Define a confidence score
- measures the image-text relevance by using pre-trained CLIP model
- returns the aesthetic score
- : tolerance constant
Dataset
Fine-tuning
teacher forcing method
loss on the next token
Stage 2: DF-Prompt Generation,
Given
- initial text
Output:
- DF-Prompt
Method: online reinforcement learning
- initial state: initial text .
- action space: tripartite
- word space
- discrete time range space
- discrete weight space
- policy: policy model
At each step of online exploration, the model selects an action , in accordance with the policy model .
- reward funstion
Training
Policy model interacts with the text-to-image model (make adjustments to the text encoder module).
Loss function
where measures the differences between the output modifiers of the policy model and those of the initial model .