Author: New Wisdom
As soon as Google StyleDrop came out, it instantly became popular on the Internet.
Given Van Gogh's starry sky, AI transformed into the master Van Gogh, and after gaining a top-level understanding of this abstract style, it created countless similar paintings.
Here is another cartoon-style picture. The object I want to draw is much cuter.
It can even accurately control the details and design a logo in an original style.
The charm of StyleDrop is that you only need one picture as a reference, and no matter how complex the art style is, you can deconstruct and reproduce it.
Netizens have expressed that this is another AI tool that will make designers obsolete.
StyleDrop's explosive research is the latest product from the Google research team.
Paper address: https://arxiv.org/pdf/2306.00983.pdf
Now, with tools like StyleDrop, you can not only paint with more control, but also complete previously unimaginable detailed work, such as drawing logos.
Even Nvidia scientists called it a "phenomenal" achievement.
Master of Customization
The author of the paper introduced that StyleDrop was inspired by Eyedropper (a color absorption/picking tool).
Similarly, StyleDrop also hopes that everyone can quickly and effortlessly "pick" a style from a single/few reference images to generate an image of that style.
A sloth can have 18 styles:
A panda has 24 styles:
StyleDrop perfectly controls the watercolor painting drawn by the child, even the wrinkles of the paper are restored.
I have to say, it’s too strong.
There is also StyleDrop which refers to different styles of English letter design:
The same Van Gogh-style letters.
There are also line drawings. Line drawings are highly abstract images, and require very high rationality in the composition of the picture. Past methods have always been difficult to succeed.
The brushstrokes of the cheese shadow in the original image are restored to the objects in each image.
Refer to Android LOGO creation.
In addition, the researchers have expanded the capabilities of StyleDrop to not only customize the style, but also customize the content by combining it with DreamBooth.
For example, let’s use the Van Gogh style to generate paintings in a similar style for Corgi:
Here’s another one. The corgi below looks like the “Sphinx” on the Egyptian pyramids.
how to work?
StyleDrop is built on Muse and consists of two key parts:
One is the efficient fine-tuning of parameters for generating visual Transformers, and the other is iterative training with feedback.
The researchers then synthesized images from the two fine-tuned models.
Muse is a state-of-the-art text-to-image synthesis model based on the mask-generated image Transformer. It contains two synthesis modules for base image generation (256 × 256) and super-resolution (512 × 512 or 1024 × 1024).
Each module consists of a text encoder T, a transformer G, a sampler S, an image encoder E, and a decoder D.
T maps a textual prompt t∈T to a continuous embedding space E. G processes the text embedding e∈E to generate a logit of visual token sequences l∈L. S extracts a visual token sequence v∈V from the logit via iterative decoding that runs several steps of transformer inference conditioned on the text embedding e and the visual tokens decoded from the previous steps.
Finally, D maps the discrete token sequence to the pixel space I. In general, given a text prompt t, the image I is synthesized as follows:
Figure 2 is a simplified architecture of the Muse transformer layer, which has been partially modified to support parameter efficient fine-tuning (PEFT) and adapters.
The sequence of visual tokens shown in green conditioned on the text embedding e is processed using a transformer with L layers. The learned parameters θ are used to construct the weights for adapter tuning.
To train θ, in many cases, researchers may only be given images as style references.
The researchers needed to manually append textual hints. They proposed a simple, templated approach to constructing textual hints that consisted of a description of the content followed by a phrase describing the style.
For example, the researchers described an object as “cat” in Table 1 and attached “watercolor” as a style description.
Including descriptions of both content and style in textual cues is crucial because it helps to separate content from style, which is the main goal of researchers.
Figure 3 shows iterative training with feedback.
When trained on a single style reference image (orange box), some images generated by StyleDrop may show content extracted from the style reference image (red box, the image background contains a house similar to the style image).
Other images (blue boxes) are better at separating style from content. Iterative training of StyleDrop on good samples (blue boxes) results in a better balance between style and text fidelity (green boxes).
Here the researchers also used two methods:
-CLIP score
This method is used to measure how well the image and text are aligned. Therefore, it can evaluate the quality of the generated image by measuring the CLIP score, which is the cosine similarity between the visual and textual CLIP embeddings.
The researchers were able to select the CLIP images with the highest scores. They call this approach iterative training with CLIP feedback (CF).
In experiments, the researchers found that using the CLIP score to evaluate the quality of synthesized images is an effective way to improve recall (i.e., text fidelity) without losing too much style fidelity.
On the other hand, however, the CLIP score may not be fully aligned with human intentions and cannot capture subtle style properties.
-HF
Human feedback (HF) is a more direct way to inject user intention directly into synthetic image quality assessment.
HF has proven to be powerful and effective in LLM fine-tuning for reinforcement learning.
HF can be used to compensate for the problem that CLIP scores cannot capture subtle style attributes.
Currently, a lot of research has focused on the personalization problem of text-to-image diffusion models to synthesize images that contain multiple personal styles.
The researchers showed how DreamBooth and StyleDrop can be combined in a simple way to allow personalization of both style and content.
This is done by sampling from two modified generative distributions, guided by θs for style and θc for content, respectively, where the adapter parameters are trained independently on style and content reference images.
Unlike existing off-the-shelf methods, the team’s approach does not require joint training of learnable parameters on multiple concepts, which leads to greater compositional capabilities since the pre-trained adapters are trained separately on individual subjects and styles.
The researchers' overall sampling process follows the iterative decoding of equation (1), with the way of sampling logarithms being different in each decoding step.
Let t be the text cue and c be the text cue without style descriptor, and calculate the logarithm at step k as follows:
Where: γ is used to balance StyleDrop and DreamBooth - if γ is 0, we get StyleDrop, if it is 1, we get DreamBooth.
By setting γ reasonably, we can get a suitable image.
Experimental setup
So far, style adaptation in text-to-image generation models has not been extensively studied.
Therefore, the researchers proposed a new experimental plan:
-data collection
The researchers collected dozens of images in different styles, from watercolor and oil paintings, flat illustrations, 3D renderings to sculptures of different materials.
-Model configuration
The researchers used the adapter to tune StyleDrop based on Muse. For all experiments, the adapter weights were updated for 1000 steps using the Adam optimizer with a learning rate of 0.00003. Unless otherwise stated, the researchers used StyleDrop to represent the second-round model, which was trained on more than 10 synthetic images with human feedback.
-Evaluate
The quantitative evaluation reported in the study is based on CLIP, which measures style consistency and text alignment. In addition, the researchers conducted a user preference study to evaluate style consistency and text alignment.
As shown in the figure, the researchers collected 18 pictures of different styles and the results of StyleDrop processing.
As you can see, StyleDrop is able to capture the nuances of texture, shading, and structure across a variety of styles, giving you greater control over style than before.
For comparison, the researchers also presented the results of DreamBooth on Imagen, DreamBooth’s LoRA implementation on Stable Diffusion, and the results of text inversion.
The specific results are shown in the table, which shows the evaluation indicators of human scores (top) and CLIP scores (bottom) for image-text alignment (Text) and visual style alignment (Style).
Qualitative comparison of (a) DreamBooth, (b) StyleDrop, and (c) DreamBooth + StyleDrop:
Here, the researchers applied the two metrics of the CLIP score mentioned above — the text and style scores.
For the text score, the researchers measured the cosine similarity between the image and text embeddings. For the style score, the researchers measured the cosine similarity between the style reference and the synthesized image embeddings.
The researchers generated a total of 1,520 images for 190 text prompts. While the researchers hoped that the final scores would be higher, these metrics are not perfect.
And iterative training (IT) improves the text score, which is in line with the researchers' goal.
However, as a trade-off, their style scores are degraded on the first round of models because they are trained on synthetic images and the style may be biased due to selection bias.
DreamBooth on Imagen falls behind StyleDrop in style score (0.644 vs. 0.694 for HF).
The researchers noticed that the increase in style score for DreamBooth on Imagen was not significant (0.569 → 0.644), while the increase for StyleDrop on Muse was more significant (0.556 → 0.694).
The researchers analyzed that style fine-tuning on Muse is more effective than on Imagen.
Additionally, for fine-grained control, StyleDrop captures subtle stylistic differences such as color shifts, layers, or sharp angles.
Hot comments from netizens
If designers had StyleDrop, their work efficiency would be 10 times faster.
One day in AI is equivalent to 10 years in the human world. AIGC is developing at the speed of light, a speed that can blind people!
Tools just follow the trend, and those that should be eliminated have already been eliminated.
This tool is much easier to use than Midjourney for making logos.
References:
https://styledrop.github.io/