The launch of text-to-video AI model Vidu at the 2024 Zhongguancun Forum on April 27, 2024 Photo: Courtesy of Zhongguancun Forum
Chinese tech-firm ShengShu-AI and Tsinghua University on Saturday unveiled text-to-video artificial intelligence (AI) model Vidu, which is said to be the first in China that's on par with Sora, in another manifestation of China's rapid development in the emerging critical AI field.
Launched at the ongoing Zhongguancun Forum in Beijing, Vidu can generate a 16-second 1080P videoclip with one click. It is built on a self-developed visual transformation model architecture called Universal Vision Transformer (U-ViT) integrating two text-to-video AI models of the Diffusion and the Transformer, the developers said.
The AI text-to-video model came just about two months after Sora, which is developed by the US-based developer OpenAI, was released to great fanfare worldwide.
"After the release of Sora, we found that it closely aligned with our technical roadmap, which further motivated us to advance our research with determination," Zhu Jun, vice dean of the Institute for Artificial Intelligence at Tsinghua University and chief scientist of ShengShu-AI, said at the forum.
The core technology of U-ViT was firstly proposed by Vidu's research team in September 2022, earlier than Sora's model architecture of DiT - Diversity in Transformation, which is the world's first visual transformation model architecture combining the advantages of the Diffusion and the Transformer, according to media reports.
During a live demonstration on Saturday, Vidu can simulate the real physical world and generate scenes with complex details in line with real physical laws, such as reasonable light and shadow effects and delicate facial expressions. It can also generate complex dynamic shots, instead of fixed ones.
Moreover, developed in China, Vidu has a great understanding of Chinese factors and can generate images of unique Chinese characters such as panda and loong, according to media reports.
Global Times