Video Models

daVinci MagiHuman Text / Image to Video Generator with Audio Sync

Create videos with daVinci MagiHuman - a 15B open-source audio-video foundation model by Sand.ai and SII GAIR Lab. Generate synchronized video and audio from text or images with industry-leading lip sync accuracy across 7 languages. Supports up to 1080p resolution with 5-10 second duration. Powered by a single-stream Transformer architecture with no cross-attention, delivering 5-second 256p video in just 2 seconds on a single H100.

/video/text-to-video

Public

Translate to English

Optimize prompt

*

daVinci MagiHuman Text to Video Gallery

Experience the cinematic power of daVinci MagiHuman text-to-video generation. Create stunning videos with synchronized audio from detailed text descriptions, featuring industry-leading lip sync across 7 languages.

Create with daVinci MagiHuman

AI Video

Rainy Tokyo Night

A woman in a red coat walks through a neon-lit Tokyo alley on a rainy night with shimmering reflections.

Prompt

“Rainy night in a neon-lit Tokyo alley, a woman in a red coat walks slowly under an umbrella. Reflections shimmer on wet cobblestones. Handheld camera follows her from behind, bokeh street lights, cinematic color grade, moody atmosphere.”

Live PipelineTake 01 / 01

daVinci MagiHuman Image to Video Gallery

Transform your static images into dynamic videos with daVinci MagiHuman. Experience seamless image-to-video conversion with realistic facial expressions, natural body motion, and synchronized lip-synced audio.

Create with daVinci MagiHuman

Source Feeds01 Inputs

Podcast Host Speaking - Input 1

Pipeline

00%

Program · On AirAI · Generated

Output

Transcript · 01

Podcast Host Speaking

daVinci MagiHuman YouTube Videos

Watch community demonstrations and reviews showcasing daVinci MagiHuman's audio-video generation capabilities

daVinci MagiHuman Popular Reviews on X

See what people are saying about daVinci MagiHuman on X (Twitter)

🪄 Introducing daVinci-MagiHuman: The Performance-Level Audio-Video Generative Foundation Model Proudly open-sourced and jointly developed by SII GAIR Lab & Sand.ai, it sets a new standard for multimodal AI. ⏳ 1/6

2:30 PM · Mar 23, 2026

daVinci-MagiHuman is a 15B single-stream Transformer, trained from scratch to generate synced video+audio with self-attention only—no cross-attention or multi-stream paths. It is open-source, supports 6 languages, beats Ovi/LTX, and runs on one H100.

2:03 AM · Mar 25, 2026

I have been testing open source daVinci-MagiHuman, a single-stream 15B Transformer trained from scratch that jointly generates video + audio. 5s 1080p video in 38s on a single H100, about 1 minute on newer gaming Nvidia GPUs By @SII_GAIR + @SandAI_HQ

1:23 PM · Mar 25, 2026

Read 10 replies

うみゆき@AI研究

daVinci-MagiHumanという新しい動画生成モデルがオープンで出た。これがLTX-2.3よりもすごいとかいう話。特にオーディオ生成がいい感じらしい。さらに多言語対応してて日本語の音声も対応してると書かれてる。開発したGAIRってのは上海イノベーション研究所内の研究ラボらしい reddit.com/r/StableDiffus…

6:54 AM · Mar 25, 2026

チャエン | デジライズ CEO《重要AIニュースを毎日最速で発信⚡️》

映像と音声を同時生成のオープンソースモデル「daVinci-MagiHuman」が登場・OSS界隈ではトップクラスの性能・日中英韓独仏の6言語対応・音声認識誤り率14.6% クローズドのSeedance 2.0に対抗。デモの感じは精度が高そう H100で5秒間の1080p動画を38秒で生成したらしい

9:51 PM · Mar 25, 2026

田中義弘 | taziku CEO / AI × Creative

動画生成AIはオープンソースでも戦えるか？ daVinci-MagiHuman は、動画と音声をシングルストリームの15B Transformerで同時生成する完全オープンソースモデル。 Ovi 1.1に80.0%、LTX 2.3に60.9%勝率。 H100で1080pの5秒の動画を38.4秒で生成。日本語にも対応！詳細は🧵

11:04 AM · Mar 26, 2026

DaVinci-MagiHuman for ComfyUI. - 15B-param single-stream model runs in ~6GB VRAM via block-level swapping; - 8-step distillation; github.com/mjansrud/Comfy…

Wildminder

@wildmindai

daVinci-MagiHuman. We have another fast single-stream audio-video 15B foundation model by @SandAI_HQ > no separate pathways or cross-attention modules. > just raw self-attention doing all the heavy lifting. > wins 80% vs Ovi 1.1, 60% vs LTX 2.3; > native multilingual realistic

9:35 AM · Mar 27, 2026

🪄 Introducing daVinci-MagiHuman: The Performance-Level Audio-Video Generative Foundation Model Proudly open-sourced and jointly developed by SII GAIR Lab & Sand.ai, it sets a new standard for multimodal AI. ⏳ 1/6

2:30 PM · Mar 23, 2026

I have been testing open source daVinci-MagiHuman, a single-stream 15B Transformer trained from scratch that jointly generates video + audio. 5s 1080p video in 38s on a single H100, about 1 minute on newer gaming Nvidia GPUs By @SII_GAIR + @SandAI_HQ

1:23 PM · Mar 25, 2026

Read 10 replies

チャエン | デジライズ CEO《重要AIニュースを毎日最速で発信⚡️》

映像と音声を同時生成のオープンソースモデル「daVinci-MagiHuman」が登場・OSS界隈ではトップクラスの性能・日中英韓独仏の6言語対応・音声認識誤り率14.6% クローズドのSeedance 2.0に対抗。デモの感じは精度が高そう H100で5秒間の1080p動画を38秒で生成したらしい

9:51 PM · Mar 25, 2026

DaVinci-MagiHuman for ComfyUI. - 15B-param single-stream model runs in ~6GB VRAM via block-level swapping; - 8-step distillation; github.com/mjansrud/Comfy…

Wildminder

@wildmindai

daVinci-MagiHuman. We have another fast single-stream audio-video 15B foundation model by @SandAI_HQ > no separate pathways or cross-attention modules. > just raw self-attention doing all the heavy lifting. > wins 80% vs Ovi 1.1, 60% vs LTX 2.3; > native multilingual realistic

9:35 AM · Mar 27, 2026

daVinci-MagiHuman is a 15B single-stream Transformer, trained from scratch to generate synced video+audio with self-attention only—no cross-attention or multi-stream paths. It is open-source, supports 6 languages, beats Ovi/LTX, and runs on one H100.

2:03 AM · Mar 25, 2026

うみゆき@AI研究

daVinci-MagiHumanという新しい動画生成モデルがオープンで出た。これがLTX-2.3よりもすごいとかいう話。特にオーディオ生成がいい感じらしい。さらに多言語対応してて日本語の音声も対応してると書かれてる。開発したGAIRってのは上海イノベーション研究所内の研究ラボらしい reddit.com/r/StableDiffus…

6:54 AM · Mar 25, 2026

田中義弘 | taziku CEO / AI × Creative

動画生成AIはオープンソースでも戦えるか？ daVinci-MagiHuman は、動画と音声をシングルストリームの15B Transformerで同時生成する完全オープンソースモデル。 Ovi 1.1に80.0%、LTX 2.3に60.9%勝率。 H100で1080pの5秒の動画を38.4秒で生成。日本語にも対応！詳細は🧵

11:04 AM · Mar 26, 2026

🪄 Introducing daVinci-MagiHuman: The Performance-Level Audio-Video Generative Foundation Model Proudly open-sourced and jointly developed by SII GAIR Lab & Sand.ai, it sets a new standard for multimodal AI. ⏳ 1/6

2:30 PM · Mar 23, 2026

うみゆき@AI研究

daVinci-MagiHumanという新しい動画生成モデルがオープンで出た。これがLTX-2.3よりもすごいとかいう話。特にオーディオ生成がいい感じらしい。さらに多言語対応してて日本語の音声も対応してると書かれてる。開発したGAIRってのは上海イノベーション研究所内の研究ラボらしい reddit.com/r/StableDiffus…

6:54 AM · Mar 25, 2026

DaVinci-MagiHuman for ComfyUI. - 15B-param single-stream model runs in ~6GB VRAM via block-level swapping; - 8-step distillation; github.com/mjansrud/Comfy…

Wildminder

@wildmindai

daVinci-MagiHuman. We have another fast single-stream audio-video 15B foundation model by @SandAI_HQ > no separate pathways or cross-attention modules. > just raw self-attention doing all the heavy lifting. > wins 80% vs Ovi 1.1, 60% vs LTX 2.3; > native multilingual realistic

9:35 AM · Mar 27, 2026

daVinci-MagiHuman is a 15B single-stream Transformer, trained from scratch to generate synced video+audio with self-attention only—no cross-attention or multi-stream paths. It is open-source, supports 6 languages, beats Ovi/LTX, and runs on one H100.

2:03 AM · Mar 25, 2026

チャエン | デジライズ CEO《重要AIニュースを毎日最速で発信⚡️》

映像と音声を同時生成のオープンソースモデル「daVinci-MagiHuman」が登場・OSS界隈ではトップクラスの性能・日中英韓独仏の6言語対応・音声認識誤り率14.6% クローズドのSeedance 2.0に対抗。デモの感じは精度が高そう H100で5秒間の1080p動画を38秒で生成したらしい

9:51 PM · Mar 25, 2026

I have been testing open source daVinci-MagiHuman, a single-stream 15B Transformer trained from scratch that jointly generates video + audio. 5s 1080p video in 38s on a single H100, about 1 minute on newer gaming Nvidia GPUs By @SII_GAIR + @SandAI_HQ

1:23 PM · Mar 25, 2026

Read 10 replies

田中義弘 | taziku CEO / AI × Creative

動画生成AIはオープンソースでも戦えるか？ daVinci-MagiHuman は、動画と音声をシングルストリームの15B Transformerで同時生成する完全オープンソースモデル。 Ovi 1.1に80.0%、LTX 2.3に60.9%勝率。 H100で1080pの5秒の動画を38.4秒で生成。日本語にも対応！詳細は🧵

11:04 AM · Mar 26, 2026

🪄 Introducing daVinci-MagiHuman: The Performance-Level Audio-Video Generative Foundation Model Proudly open-sourced and jointly developed by SII GAIR Lab & Sand.ai, it sets a new standard for multimodal AI. ⏳ 1/6

2:30 PM · Mar 23, 2026

チャエン | デジライズ CEO《重要AIニュースを毎日最速で発信⚡️》

映像と音声を同時生成のオープンソースモデル「daVinci-MagiHuman」が登場・OSS界隈ではトップクラスの性能・日中英韓独仏の6言語対応・音声認識誤り率14.6% クローズドのSeedance 2.0に対抗。デモの感じは精度が高そう H100で5秒間の1080p動画を38秒で生成したらしい

9:51 PM · Mar 25, 2026

daVinci-MagiHuman is a 15B single-stream Transformer, trained from scratch to generate synced video+audio with self-attention only—no cross-attention or multi-stream paths. It is open-source, supports 6 languages, beats Ovi/LTX, and runs on one H100.

2:03 AM · Mar 25, 2026

田中義弘 | taziku CEO / AI × Creative

動画生成AIはオープンソースでも戦えるか？ daVinci-MagiHuman は、動画と音声をシングルストリームの15B Transformerで同時生成する完全オープンソースモデル。 Ovi 1.1に80.0%、LTX 2.3に60.9%勝率。 H100で1080pの5秒の動画を38.4秒で生成。日本語にも対応！詳細は🧵

11:04 AM · Mar 26, 2026

I have been testing open source daVinci-MagiHuman, a single-stream 15B Transformer trained from scratch that jointly generates video + audio. 5s 1080p video in 38s on a single H100, about 1 minute on newer gaming Nvidia GPUs By @SII_GAIR + @SandAI_HQ

1:23 PM · Mar 25, 2026

Read 10 replies

DaVinci-MagiHuman for ComfyUI. - 15B-param single-stream model runs in ~6GB VRAM via block-level swapping; - 8-step distillation; github.com/mjansrud/Comfy…

Wildminder

@wildmindai

daVinci-MagiHuman. We have another fast single-stream audio-video 15B foundation model by @SandAI_HQ > no separate pathways or cross-attention modules. > just raw self-attention doing all the heavy lifting. > wins 80% vs Ovi 1.1, 60% vs LTX 2.3; > native multilingual realistic

9:35 AM · Mar 27, 2026

うみゆき@AI研究

daVinci-MagiHumanという新しい動画生成モデルがオープンで出た。これがLTX-2.3よりもすごいとかいう話。特にオーディオ生成がいい感じらしい。さらに多言語対応してて日本語の音声も対応してると書かれてる。開発したGAIRってのは上海イノベーション研究所内の研究ラボらしい reddit.com/r/StableDiffus…

6:54 AM · Mar 25, 2026

Reel · Specifications

What's daVinci MagiHuman

Sand.ai's 15B open-source audio-video foundation model with best-in-class lip sync

· 0115BParameters
· 021080pMax Resolution
· 037Languages Supported
· 042s256p Generation Speed

daVinci MagiHuman is a 15-billion parameter single-stream Transformer that jointly generates synchronized video and audio from text or images, achieving industry-leading lip sync accuracy with a 14.6% word error rate across 7 languages.

Reel · Capabilities

daVinci MagiHuman's Powerful Features

Discover the advanced capabilities that make daVinci MagiHuman exceptional for audio-video generation

Feature 01 / 08
Joint Audio-Video Generation
Generate synchronized video and audio in a single pass using a unified single-stream Transformer architecture with self-attention only, eliminating the need for separate audio pipelines.
Feature 02 / 08
Industry-Leading Lip Sync
Achieve 14.6% word error rate for lip synchronization, significantly outperforming competitors like Ovi 1.1 (40.45%) and LTX 2.3 (19.23%) in speech accuracy benchmarks.
Feature 03 / 08
7-Language Speech Support
Generate speech-synchronized videos in English, Chinese (Mandarin and Cantonese), Japanese, Korean, German, and French with natural pronunciation and lip movements.
Feature 04 / 08
Ultra-Fast Generation
Produce a 5-second 256p video in just 2 seconds on a single H100 GPU. 8-step DMD-2 distillation eliminates the need for classifier-free guidance without quality loss.
Feature 05 / 08
Dual Input Modes
Create videos from text prompts or animate still images. Both text-to-video and image-to-video modes support configurable aspect ratios, resolutions, and durations from 5 to 10 seconds.
Feature 06 / 08
Up to 1080p Super-Resolution
Generate videos at 256p, 540p, 720p, or 1080p through a latent-space super-resolution pipeline that upscales without extra VAE decode-encode overhead for efficient high-res output.
Feature 07 / 08
Open Source Apache 2.0
Fully open-sourced under Apache 2.0 license with complete stack including base weights, distilled model, super-resolution model, and inference code for unrestricted commercial use.
Feature 08 / 08
Human-Centric Excellence
Specialized in digital human generation with expressive facial performance, realistic body motion, and consistent character preservation across frames for professional talking-head content.

FAQ

Frequently Asked Questions

Common questions about daVinci MagiHuman audio-video generation

Still have questions?

[email protected]

Join our Discord Submit a Ticket

daVinci MagiHuman supports two primary input modes: Text-to-Video (generate videos with synchronized audio from text prompts) and Image-to-Video (animate still images into motion videos with optional audio). Both modes support configurable aspect ratios (16:9 landscape, 9:16 portrait), resolutions up to 1080p, and durations from 5 to 10 seconds.

daVinci MagiHuman supports synchronized speech generation in 7 languages: English, Chinese (Mandarin), Cantonese, Japanese, Korean, German, and French. The model achieves a 14.6% word error rate for lip synchronization, significantly outperforming competitors like Ovi 1.1 (40.45% WER) and LTX 2.3 (19.23% WER).

daVinci MagiHuman supports multiple resolutions: 256p (fastest), 540p (super-resolution), 720p, and 1080p (super-resolution). Video duration can be configured from 5 to 10 seconds with 1-second granularity. Both landscape (16:9) and portrait (9:16) aspect ratios are supported.

On a single NVIDIA H100 GPU, daVinci MagiHuman generates a 5-second 256p video in approximately 2 seconds. For higher resolutions, generation times increase: 540p takes about 8 seconds and 1080p takes about 38.4 seconds for a 5-second video. This speed is achieved through 8-step DMD-2 distillation that eliminates classifier-free guidance.

Yes, daVinci MagiHuman is fully open-sourced under the Apache 2.0 license by Sand.ai and SII GAIR Lab. The complete stack is available including base model weights, distilled model, super-resolution model, and inference code, allowing unrestricted commercial use, modification, and distribution.

daVinci MagiHuman stands out with its unique single-stream Transformer architecture that uses self-attention only (no cross-attention or multi-stream paths), enabling joint audio-video generation in a single model. It achieves best-in-class lip sync accuracy (14.6% WER), supports 7 languages for speech, and an 80% win rate against Ovi 1.1 in human evaluation of visual quality.

How to Use daVinci MagiHuman Text to Video

Generate videos with synchronized audio from text descriptions

Write Your Prompt

Enter a detailed description of the video you want to create. Include subject, action, speech content, and desired language for best lip-synced results.

How to Use daVinci MagiHuman Image to Video

Animate still images into videos with synchronized audio

Upload Your Image

Upload a reference image of the person or scene you want to animate. daVinci MagiHuman excels at human-centric content with realistic facial expressions and body motion.

Pricing · Choose Yours

Flexible AI Pricing

Pay-as-you-go credits or subscription plans. No hidden fees, cancel anytime.

One Time supports crypto payment (BTC, USDT, ETH, 350+)

Monthly billing

Free-One Time

Try before you buy

0

One Time

USD

Free

32points

Up to 3 videos

Up to 32 images

Multi-Model Support

Text to Video

Image to Video

Video to Video

Consistent Character

AI Animation Generator

Templates & Effects

AI Video Enhancers

Interactive Community

Faster Generation Speed

No-watermark Outputs

More Camera Movement

Private Video Visibility

Copy Protection

Priority Support

Popular

Pro-1 Month

Elevate your AI experience

29.99

1 Month

USD

800

800points1 Month

Up to 80 videos1 Month

Up to 800 images1 Month

3 tasks(Parallel Tasks)

Multi-Model Support

Text to Video

Image to Video

Video to Video

Consistent Character

AI Animation Generator

Templates & Effects

AI Video Enhancers

Interactive Community

Faster Generation Speed

No-watermark Outputs

More Camera Movement

Private Video Visibility

Copy Protection

Priority Support

Lite-1 Month

Start your AI journey

19.99

1 Month

USD

300points1 Month

Up to 30 videos1 Month

Up to 300 images1 Month

3 tasks(Parallel Tasks)

Multi-Model Support

Text to Video

Image to Video

Video to Video

Consistent Character

AI Animation Generator

Templates & Effects

AI Video Enhancers

Interactive Community

Faster Generation Speed

No-watermark Outputs

More Camera Movement

Private Video Visibility

Copy Protection

Priority Support

View Detailed Pricing