Article reprint source: AIcore

Original source: New Wisdom

Image source: Generated by Unbounded AI

It’s incredible!

Now you can create beautiful, high-quality 3D models just by typing a few words?

Just now, a foreign blog has set the Internet on fire, and put something called MVDream in front of us.

Users can create a lifelike 3D model with just a few words.

And unlike before, MVDream seems to really "understand" physics.

Let's take a look at how magical this MVDream is~

MVDream

The guy said that in the era of big models, we have seen too many text generation models and image generation models, and the performance of these models is getting stronger and stronger.

Later, we even witnessed the birth of Vincent video models, and of course the 3D models we are going to talk about today.

Just imagine that you only need to input a sentence to generate a model of an object that looks like it exists in the real world and even contains all the necessary details. How cool would this scene be?

And this is definitely not an easy task, especially when the details that users need to generate models must be realistic enough.

Let’s take a look at the effect first~

The same prompt, the finished product of MVDream is on the far right.

The differences among the five models are visible to the naked eye. The first few models completely violate objective facts and are only correct from certain perspectives.

For example, in the first four pictures, the generated model actually has more than two ears. Although the fourth picture looks more detailed, we can see from a certain angle that the character's face is concave and has an ear stuck on it.

Who understands? The editor immediately remembered the front view of Peppa Pig which was very popular before.

It's like that, certain angles are shown to you, and you must not look at other angles, otherwise you will die.

But the model generated by MVDream on the far right is obviously different. No matter how the 3D model is rotated, you will not feel anything unconventional.

This is what I mentioned at the beginning. MVDream really understands common sense of physics and will not come up with some weird things to ensure that there are two ears in each view.

The guy pointed out that the most important thing to determine whether a 3D model is successful is to observe whether the model is realistic enough and of high quality from different perspectives.

And we also need to ensure the spatial coherence of the model, rather than the model with multiple ears above.

One of the main methods of generating 3D models is to simulate the camera's perspective and then generate what can be seen from a certain perspective.

In other words, this is called 2D lifting, which is to stitch together different perspectives to form the final 3D model.

The reason why there are multiple ears is that the generative model does not have sufficient information about the shape of the entire object in three-dimensional space. MVDream has taken a big step forward in this regard.

The new model solves the consistency issues in 3D perspective that have been occurring before.

Fractional Distillation Sampling

The method used is called score distillation sampling, developed by DreamFusion.

Before understanding the fractional distillation sampling technique, we need to understand the architecture used by this method.

In short, this is just another diffusion model for two-dimensional images, similar to the DALLE, MidJourney and Stable Diffusion models.

More specifically, everything starts with a pre-trained DreamBooth model, an open source model based on Stable Diffusion raw images.

Then, change came.

What the research team did next was to directly render a set of multi-view images instead of just one image. This step requires a three-dimensional data set of various objects.

Here, the researchers took multiple views of three-dimensional objects from a dataset and used them to train the model to generate those views backwards.

The specific approach is to change the blue self-attention block in the figure below to a three-dimensional self-attention block, that is, the researchers only need to add one dimension to reconstruct multiple images instead of one image.

In the image below, we can see that the camera and timestep are also input into the model for each view to help the model understand which image will be used where and what kind of view needs to be generated.

Now, all the images are connected together, and the generation is done together, so they can share information and understand the overall situation better.

The text is then fed into the model, which is trained to accurately reconstruct objects from the dataset.

And this is where the research team applied the multi-view score distillation sampling process.

Now, with a multi-view diffusion model, the team can generate multiple views of an object.

The next step is to use these views to reconstruct a 3D model that is consistent with the real world, not just the view.

This requires the use of NeRF (neural radiance fields), just like the DreamFusion mentioned earlier.

Basically, this step is to freeze the previously trained multi-view diffusion model. That is to say, in this step, the pictures of the above perspectives are only "used" and will not be "trained" again.

Guided by the initial renderings, the researchers began to generate some noise-added versions of the initial images using a multi-view diffusion model.

The researchers added noise to let the model know that it needed to generate different versions of the image while still picking up context.

The model is then used to further generate higher quality images.

Add the image used to generate this image with the noise we manually added removed so that we can use the result to guide and improve the NeRF model in the next step.

These steps are all about better understanding which part of the image the NeRF model should focus on in order to generate better results in the next step.

This process is repeated until a satisfactory 3D model is generated.

This is how the team evaluated the image generation quality of the multi-view diffusion model and determined how different designs would affect its performance.

First, they compared choices of attention modules for modeling cross-view consistency.

These options include:

(1) One-dimensional temporal self-attention, which is widely used in video diffusion models;

(2) Adding a new 3D self-attention module to the existing model;

(3) Reuse the existing 2D self-attention module for 3D attention.

In this experiment, to clearly show the differences between these modules, the researchers used 8 frames of 90-degree viewpoint changes to train the model, which is closer to the video setting.

At the same time, in the experiment, the research team also maintained a high image resolution, that is, 512×512 as the original standard definition model. The results are shown in the figure below. The researchers found that even with such limited perspective changes in static scenes, temporal self-attention is still affected by content offset and cannot maintain perspective consistency.

The team hypothesizes that this is because temporal attention can only exchange information between the same pixels in different frames, while corresponding pixels may be far apart when the viewpoint changes.

On the other hand, adding new 3D attention without learning consistency leads to severe quality degradation.

The researchers believe that this is because learning new parameters from scratch consumes more training data and time, which is not suitable for this situation where the 3D model is limited. The strategy proposed by the researchers to reuse 2D self-attention achieves the best consistency without reducing the generated quality.

The team also noticed that the differences between these modules were much smaller if the image size was reduced to 256 and the number of views was reduced to 4. However, in order to achieve the best consistency, the researchers made a choice based on preliminary observations in the following experiments.

In addition, for multi-view fractional distillation sampling, the researchers implemented the guidance of multi-view diffusion in the ThreeStudio (THR) library, which implements the most advanced text-to-3D model generation methods under a unified framework.

The researchers used the implicit-volume implementation in ThreeStudio as the 3D representation, which includes a multi-resolution hash-grid.

For the camera views, the researchers sampled the cameras in exactly the same way as when rendering the 3D dataset.

In addition, the researchers also optimized the 3D model using the AdamW optimizer for 10,000 steps with a learning rate of 0.01.

For fractional distillation sampling, the maximum and minimum time steps are reduced from 0.98 steps to 0.5 steps and 0.02 steps, respectively, in the first 8000 steps.

The rendering resolution starts from 64×64 and gradually increases to 256×256 after 5000 steps.

More examples are as follows:

The above is how the research team used the 2D text-to-image model, used it for multi-view synthesis, and finally used it to iterate and create the text-to-3D model process.

Of course, this new method still has certain limitations. The most important drawback is that the images generated now are only 256x256 pixels, which can be said to be a very low resolution.

In addition, the researchers point out that the size of the dataset used to perform this task will inevitably limit the versatility of this method to some extent, because if the dataset is too small, it will not be able to more realistically reflect our complex world.

References:

https://www.louisbouchard.ai/mvdream/

https://arxiv.org/pdf/2308.16512.pdf