Author: Steven Wang

“What I cannot create, I do not understand.” -

Richard Feynman

Preface

You create beautiful images with Stable Diffusion and MidJourney on your left and right.

You are proficient in using ChatGPT and LLaMa to create elegant texts.

You switch back and forth between MuseNet and MuseGAN to create music that sounds like mountains and flowing water.

There is no doubt that the most unique human ability is to create, but in today's rapidly developing technology, we create by creating machines! Machines can draw original works of art in a given style (draw), write a long and coherent article (write), create pleasant music (compose), and develop winning strategies for complex games (play). This technology is Generative Artificial Intelligence (GenAI), and now is just the beginning of the GenAI revolution. Now is the best time to learn GenAI.

1. Generative and discriminative models

GenAI is a buzzword. The essence behind it is the generative model, which is a branch of machine learning. The goal is to train the model to generate new data similar to a given data set.

Suppose we have a dataset of horses. First, we can train a generative model on this dataset to capture the rules that govern the complex relationships between pixels in horse images. Then, we sample from this model to create realistic horse images that do not exist in the original dataset, as shown in the figure below.

In order to truly understand the goal and importance of generative models, it is necessary to compare them with discriminative models. In fact, most problems in machine learning are solved by discriminative models. Let’s take a look at the following example.

Suppose we have a dataset of paintings, some by Van Gogh and some by other artists. With enough data, we can train a discriminative model to predict whether a given painting was painted by Van Gogh, as shown in the figure below.

When using a discriminative model, each example in the training set has a label. For the above two-class problem, Van Gogh's paintings are usually labeled 1 and non-Van Gogh paintings are labeled 0. In the above figure, the model's final predicted probability is 0.83, so it is very likely to be painted by Van Gogh. Unlike the discriminative model, the generative model does not require labels in the examples because its goal is to generate new data rather than predict labels for the data.

After reading the example, let us use mathematical symbols to precisely define the generative model and the discriminative model:

  • The discriminative model models P(y|x), estimating the conditional probability of label y given feature x.

  • The generative model models P(x) and directly estimates the probability of feature x. New features can be generated by sampling from this probability distribution.

It is important to note that even if we can build a discriminative model that perfectly recognizes Van Gogh's paintings, it still does not know how to create a painting that looks like Van Gogh. It can only output a probability, that is, the possibility that the image came from Van Gogh's hand. It can be seen that generative models are much more difficult than discriminative models.

2. Framework of Generative Model

Before understanding the framework of the generative model, let's play a game. Suppose the points in the figure below are generated by a certain rule, which we call Pdata. Now let you generate a different x = (x1, x2) so that this point looks like it is generated by the same rule Pdata.

How would you generate this point? You may use the given points to generate a model Pmodel in your mind, and the positions occupied by this model can generate the points you want. Therefore, the model Pmodel is an estimate of Pdata. Then the simplest model Pmodel is the orange box in the figure below. Points can only be generated inside the box, but not outside the box.

To generate new points, we can randomly select a point from the box, or more precisely, sample from the distribution of the model Pmodel. This is a very simple generative model. You create a model (orange box) from the training data (black points), and then you can sample from the model, hoping that the generated points will look similar to the points in the training set.

Now we can formally propose a framework for generative learning.

Now let’s reveal the true data generating distribution Pdata and see how to apply the above framework to this example. From the figure below we can see that the data generating rule Pdata is that the points are uniformly distributed only over land and never appear in the ocean.

Obviously, our model Pmodel is a simplification of the rule Pdata. Checking points A, B, and C in the above figure can help us understand whether the model Pmodel successfully imitates the rule Pdata.

  • Point A does not conform to rule Pdata because it occurs in the sea, but can be generated by model Pmodel because it occurs within the orange box.

  • Point B cannot be generated by model Pmodel because it appears outside the orange box, but conforms to rule Pdata because it appears on land.

  • Point C is generated by the model Pmodel and conforms to the rule Pdata.

This example shows the basic concepts behind generative modeling. Although using generative models in reality is much more complicated, the basic framework is the same.

3. The First Generative Model

Suppose you are the Chief Fashion Officer (CFO) of a company and your job is to create new fashionable clothes. This year you received 50 data sets about fashion combinations (as shown below), and you need to create 10 new fashion combinations.

Although you are the Chief Fashion Officer, you are also a data scientist, so you decide to use a generative model to solve this problem. After looking at the 50 pictures above, you decide to use five features, accessories type, clothing color, clothing type, hair color, and hair type, to describe fashion combinations.

The first 10 image data features are as follows.

Each feature also has a different number of eigenvalues:

  • 3 types of accessories:

    • Blank, Round, Sunglasses

  • 8 clothing colors:

    • Black, Blue01, Gray01, PastelGreen, PastelOrange, Pink, Red, White

  • 4 clothing types:

    • Hoodie, Overall, ShirtScoopNeck, ShirtVNeck

  • 6 hair colors:

    • Black, Blonde, Brown, PastelPink, Red, SilverGray

  • 7 hair types:

    • NoHair, LongHairBun, LongHairCurly, LongHairStraight, ShortHairShortWaved, ShortHairShortFlat, ShortHairFrizzle

There are 3 * 8 * 4 * 6 * 7 = 4032 feature combinations, so we can think of the sample space as containing 4032 points. From the given 50 data points, we can see that Pdata prefers certain feature values ​​for different features. From the table above, we can see that there are more white clothing colors and silver-gray hair colors in the image. Since we don't know the real Pdata, we can only use these 50 data points to build a Pmodel that is close to Pdata.

3.1 Minimalist Model

The simplest method is to assign a probability parameter to each point in the 4032 feature combinations, then the model contains 4031 parameters, because all probability parameters add up to 1. Now let's check the 50 data one by one, and then update the parameters of the model (θ1, θ2, ..., θ4031), and the expression of each parameter is:

Where N is the number of observations, i.e. 50, and nj is the number of occurrences of the jth feature combination in the 50 data.

For example, the feature combination (LongHairStraight, Red, Round, ShirtScoopNeck, White) (called combination 1) appears twice, then

For example, the feature combination (LongHairStraight, Red, Round, ShirtScoopNeck, Blue01) (called combination 2) does not appear, then

According to the above rules, we calculate a θ value for each of the 4031 combinations. It is not difficult to see that many θ values ​​are 0. What’s worse is that we cannot generate new unseen pictures (θ = 0 means that we have never observed a picture with this feature combination). To solve this problem, we only need to add the total number of features d to the denominator and 1 to the numerator. This technique is called Laplace smoothing.

Now, every combination (including those not in the original dataset) has a non-zero probability of being sampled, however this is still not a satisfactory generative model because the probability of points not in the original dataset is constant. If we try to use such a model to generate Van Gogh paintings, it will operate on the following two paintings with equal probability:

  1. Reproduction of a painting by Vincent van Gogh (not in the original dataset)

  2. A painting made of random pixels (not in the original dataset)

This is obviously not what we want in a generative model, we hope that it can learn some inherent structure from the data so that it can increase the probability weights of areas in the sample space that it thinks are more likely, rather than putting all the probability weights on points that exist in the dataset.

3.2 Subsimple Model

The Naive Bayes model can greatly reduce the number of feature combinations above. According to its model, each feature is assumed to be independent of each other. Back to the data above, a person's hair color (feature xj) has no connection with the color of his clothes (feature xk). The mathematical expression is:

p(xj | xk) = p(xk)

With this assumption, we can calculate

The Naive Bayes model simplifies the original problem of "estimating the probability of each feature combination" into "estimating the probability of each feature". Originally we needed 4031 (3 * 8 * 4 * 6 * 7) parameters, but now we only need 23 (3 + 8 + 4 + 6 + 7) parameters. The expression of each parameter is:

Where N is the number of observations, i.e. 50, and nkl is the number of l-th eigenvalues ​​under the k-th feature.

By counting 50 data, the following table gives the parameter values ​​of the naive Bayes model.

To calculate the probability that the model generates a certain data feature, just multiply the probabilities in the table, for example:

The above combination did not appear in the original dataset, but the model still assigned it a non-zero probability, so it was still able to be generated by the model. Therefore, the Naive Bayes model was able to learn some structure from the data and use it to generate new examples that were not seen in the original dataset. The following figure shows 10 new fashion combinations generated by the model.

In this problem, only 5 features belong to low-dimensional data. The Naive Bayes model assumes that they are independent of each other, which is reasonable, so the results generated by the model are not bad. Let's take a look at an example of model collapse.

4. Difficulties in Generative Models

4.1 High-dimensional data

As the Chief Fashion Officer, you have successfully used Naive Bayes to generate 10 new fashion combinations. You are full of confidence and think your model is invincible until you encounter the following dataset.

The dataset is no longer represented by five features, but by 32*32 = 1024 pixels. Each pixel value can be one of 0 to 255, 0 represents white, and 255 represents black. The following table lists the values ​​of pixels 1 to 5 of the first 10 images.

Use the same model to generate 10 new fashion combinations. Below are the results generated by the model. Each one is ugly and similar, and it is impossible to distinguish different features. Why is this so?

First, since the Naive Bayes model samples pixels independently, adjacent pixels are actually very similar. For clothes, the pixels should actually be roughly the same, but the model randomly samples, so the clothes in the above picture are all colorful. Second, there are too many possibilities in the high-dimensional sample space, and only a small part of them are identifiable. If the Naive Bayes model directly processes such highly correlated pixel values, then it has very little chance of finding a satisfactory combination of values.

In summary, for low-dimensional sample spaces with low feature correlation, the Naive Bayes effect produced by independent sampling works very well; but for high-dimensional sample spaces with high feature correlation, it is almost impossible to find valid faces by independently sampling pixels.

This example highlights two challenges that generative models must overcome to be successful:

  1. How does the model handle conditional dependencies between high-dimensional features?

  2. How does the model find the very small proportion of observations that meet the conditions from the high-dimensional sample space?

For generative models to succeed in high-dimensional and highly correlated sample spaces, deep learning models must be used. We need a model that can infer relevant structures from the data, rather than being told what assumptions to make in advance. Deep learning can form its own features in low-dimensional space, which is a form of representation learning.

4.2 Representation Learning

Representation learning is to learn the meaning of the representation of high-dimensional data.

Suppose you go to meet an online friend whom you have never met before. When you arrive at the meeting place, there are too many people to find her. You call her to describe your appearance. I believe you will not say that the color of pixel 1 in your image is black, the color of pixel 2 is light black, the color of pixel 3 is gray, etc. On the contrary, you will think that the online friend will have a general understanding of the appearance of ordinary people, and then describe the characteristics of the pixel group based on this understanding, for example, you have short black hair, wear a pair of golden glasses, etc. Usually no more than 10 such descriptions are needed for the online friend to generate an image of you in his mind. The image may be rough, but it does not prevent the online friend from finding you among hundreds of people, even if she has never seen you.

This is the core idea behind representation learning. Instead of trying to model the high-dimensional sample space directly, we use some low-dimensional latent space to describe each observation in the training set, and then learn a mapping function that can take a point in the latent space and map it to the original sample space. In other words, each point in the latent space represents the characteristics of the high-dimensional data.

If the above is difficult to understand, please look at the training set consisting of some grayscale jar images below.

It is not difficult to see that these cans can be described by only two features: height and width. Therefore, we can transform the high-dimensional pixel space of the image into a two-dimensional latent space, as shown in the figure below. In this way, we can sample from the latent space (blue points) and then transform it into an image through the mapping function f.

Recognizing that the original dataset can be represented by a simpler latent space is not easy for a machine to do. It first needs to determine that height and width are the two latent space dimensions that best describe the dataset, and then learn a mapping function $$f$$ that can take a point in this space and map it to a grayscale can image. Deep learning allows us to train machines to find these complex relationships without human guidance.

5. Classification of Generative Models

All types of generative models ultimately aim to solve the same task, but they all model the density function in slightly different ways. Generally speaking, there are two categories:

  • explicitly modeling the density function,

    • But constrain the model in some way so that the density function can be calculated, such as the normalizing FLOW model

    • But approximating density functions, such as variational autoencoder (VAE) and diffusion model

  • Implicitly modeling the density function, through a random process that directly generates data. For example, generative adversarial network (GAN)

Summarize

Generative artificial intelligence (GenAI) is a type of artificial intelligence that can be used to create new content and ideas (including text, images, videos, and music). Like all artificial intelligence, GenAI is a very large model pre-trained by a deep learning model based on a large amount of data, usually called a foundation model (FM). With GenAI, we can draw cooler images, write more beautiful texts, and compose more moving music, but the first step requires us to understand how GenAI creates new things, just as Richard Feynman said at the beginning of the article, "I don't understand what I can't create."