Article reprint source: AI Trends

Original source: New Wisdom

Image source: Generated by Unbounded AI

The era of freeing your hands and typing with your mouth has really arrived.

When you want to write a promotional manuscript for "Genshin Impact", you don't need to search the Internet to collect various materials. You just need to give the model an instruction "Help me write an article on the topic of Genshin Impact".

The game background, launch time, influence and other key points are already written. Next, you can also let LLM automatically insert interesting and vivid pictures.

In the blink of an eye, the customized masterpiece is completed.

So, what kind of model has such magical power?

It is InternLM-XComposer (hereinafter referred to as "InternLM-XComposer"), the first large-scale mixed text and image creation model launched by Shanghai Artificial Intelligence Laboratory (Shanghai AI Laboratory).

Relying on powerful multimodal performance, it can unlock the ability to create mixed text and image articles with one click, providing more possibilities for the application of large models.

Currently, Pu Yu Ling Bi has open-sourced its intelligent creation and dialogue (InternLM-XComposer-7B) and multi-task pre-training (InternLM-XComposer-VL-7B) versions, and provides them for free commercial use.

Open source link: https://github.com/InternLM/InternLM-XComposer

Technical report: https://arxiv.org/abs/2309.15112

Since July this year, the Shanghai AI Laboratory has successively open-sourced the 7B (InterLM-7B) and 20B (InternLM-20B) versions of the Shusheng·Puyu large language model, providing the industry with a complete large-model R&D and application base, as well as a full-chain tool system.

Based on the Shusheng Puyu Large Language Model (InternLM), Puyu Lingbi accepts visual and language modal input. It not only performs well in picture-text dialogue, but also has the ability to "generate" articles with both pictures and text in one click.

Accurate picture and text understanding, one-click picture and text

Puyu Lingbi can conduct fluent Chinese and English text and image dialogues and accurately understand the content of images. Thanks to the advantages of Shusheng Puyu's high-quality multi-language pre-training, Puyu Lingbi demonstrates a deep knowledge accumulation of Chinese culture.

For example, when a relevant painting is input into Pu Yulingbi, it can quickly identify and feedback that the theme of the painting is the allusion of "The Battle of Red Cliffs". It can also accurately introduce the key factors affecting success or failure, reflecting its excellent performance in image content understanding and knowledge reserves.

Pu Yulingbi identifies Chinese cultural allusions

On top of the "basic skills" of multimodal text and picture dialogue, Pu Yulingbi has unlocked a new ability to create articles with both pictures and text.

The Large Language Model (LLM) has the ability to write text, but high-quality articles often require accurate and interesting illustrations to make them "more vivid."

The Pu Yu Ling Bi team has expanded the powerful language capabilities of Pu Yu Shu Sheng to multimodal, enabling it to complete multimodal article creation. Users only need to provide a topic to generate an article with pictures and texts in one click, and experience a new visual and text creation paradigm.

For example, if Pu Yulingbi is asked to create a travel guide, the model can quickly generate a long article covering the historical evolution, major attractions and cultural relics, and automatically insert pictures corresponding to the text information at appropriate locations.

In addition to the ability to automatically match pictures, Pu Yu Ling Bi also provides picture recommendation and replacement functions to customize graphic content according to the actual needs of users.

Pu Yulingbi generates Chinese travel guides

At present, Pu Yu Ling Bi already supports the generation of text and pictures for popular science articles, marketing advertisements, news releases, film and television reviews, life guides and other types of articles, and will gradually open up more capabilities to adapt to more diverse task requirements.

Pu Yulingbi generates English movie reviews

Three steps to create graphic articles

Pu Yulingbi has designed a "three-step" algorithm process for the creation of graphic and text articles.

Pu Yu Ling Bi's graphic and text article creation process

Understand user instructions and create long articles that meet the requirements of the topic: Pu Yu Ling Bi has powerful writing capabilities and can create brilliant articles based on the topics entered by users.

Intelligent analysis of articles, the model automatically plans the ideal location of illustrations and generates the content requirements of the required images: Puyu Lingbi automatically analyzes the content and paragraph layout of the article and plans the location of the required illustrations. For each model, it determines the location of the required illustrations and generates a description of the image content requirements.

Multi-level intelligent screening, using the image understanding ability of the multimodal large model, locks the most perfect picture from the gallery: adopting a strategy from rough screening to fine selection, according to the needs of generating image content, Pu Yu Lingbi will first use the text-image retrieval method to select a group of candidate pictures from the massive gallery. Then, using the powerful image understanding ability of the multimodal large model, the candidate pictures are used as input content, allowing the model to automatically select the picture that best matches the context content and overall image style of the article, and complete the automatic picture matching of the article.

Capability assessment: Comprehensive leading open source multimodal large model

Pu Yulingbi's excellent graphic and text creation effect is due to the powerful multimodal understanding ability of its multi-task pre-trained model (InternLM-XComposer-VL-7B).

The researchers conducted a detailed test of the capabilities of InternLM-XComposer-VL-7B using five mainstream multimodal large model evaluations, including:

- MME Benchmark: A comprehensive evaluation of multimodal models including 14 subtasks, focusing on the perception and recognition capabilities of the model;

- MMBench: A multimodal evaluation including 20 capability dimensions and using the ChatGPT loop evaluation strategy;

- MMBench-CN: Simplified Chinese version of the MMBench review with questions and answers;

- Seed-Bench: provides multimodal evaluation including 19,000 manually annotated multimodal multiple-choice questions;

- CCBench: A Chinese multimodal benchmark for Chinese cultural understanding.

The evaluation results show that in the above five Chinese and English multimodal evaluations, Pu Yu Ling Pen has demonstrated excellent performance.

Performance comparison of Puyu Lingbi with other open source models

MME Benchmark focuses on the model's perception and recognition capabilities, and Puyu Lingbi has leading overall performance.

MMBench includes 20 ability items, and Pu Yulingbi achieved the best score.

MMBench-CN is the Chinese version of the MMBench evaluation, which focuses on the model's Chinese multimodal understanding capabilities. Pu Yu Ling Bi also achieved the best results, demonstrating its strong Chinese capabilities.

SEED-Bench provides 19,000 multimodal multiple-choice questions with manual annotations, covering 12 evaluation dimensions. Pu Yu Lingbi has excellent accuracy in understanding image content.

In CCBench, a multimodal evaluation designed for understanding Chinese culture, Pu Yuling Pen's test scores are significantly ahead, vividly demonstrating its profound knowledge accumulation in Chinese culture.

Pu Yu Ling Bi is now open source and available on GitHub, Hugging Face and ModelScope. Developers are welcome to download and try it out.