According to Metropolis Express, Alibaba Damo Academy released the "Text Generation Video Model" in ModelScope yesterday. According to the official introduction, this model currently consists of three sub-networks: text feature extraction, text feature to video latent space diffusion model, and video latent space to video visual space. The overall model parameters are about 1.7 billion. English input is supported. The diffusion model uses the Unet3D structure and implements the video generation function through the iterative denoising process from pure Gaussian noise videos.
Earlier in February, it was reported that Alibaba's version of the chatbot ChatGPT is under development and is currently in the internal testing stage.
