Logo

daVinci MagiHuman Text / Image to Video Generator with Audio Sync

Create videos with daVinci MagiHuman - a 15B open-source audio-video foundation model by Sand.ai and SII GAIR Lab. Generate synchronized video and audio from text or images with industry-leading lip sync accuracy across 7 languages. Supports up to 1080p resolution with 5-10 second duration. Powered by a single-stream Transformer architecture with no cross-attention, delivering 5-second 256p video in just 2 seconds on a single H100.

Public
*

daVinci MagiHuman Text to Video Gallery

Experience the cinematic power of daVinci MagiHuman text-to-video generation. Create stunning videos with synchronized audio from detailed text descriptions, featuring industry-leading lip sync across 7 languages.

Create with daVinci MagiHuman
AI Video

Rainy Tokyo Night

A woman in a red coat walks through a neon-lit Tokyo alley on a rainy night with shimmering reflections.

Prompt

Rainy night in a neon-lit Tokyo alley, a woman in a red coat walks slowly under an umbrella. Reflections shimmer on wet cobblestones. Handheld camera follows her from behind, bokeh street lights, cinematic color grade, moody atmosphere.

daVinci MagiHuman Image to Video Gallery

Transform your static images into dynamic videos with daVinci MagiHuman. Experience seamless image-to-video conversion with realistic facial expressions, natural body motion, and synchronized lip-synced audio.

Create with daVinci MagiHuman
Input
Podcast Host Speaking - Input 1
Output
Podcast Host Speaking

daVinci MagiHuman YouTube Videos

Watch community demonstrations and reviews showcasing daVinci MagiHuman's audio-video generation capabilities

  • daVinci-MagiHuman: Fast Audio-Video Synthesis - AI Research Roundup
  • 达芬奇最新开源模型,革命Seedance2.0 DaVinci-MagiHuman:开源音视频生成新标杆,5秒视频2秒出,还能说6种语言! - XIAOXIAO LI
  • LTX 2.3, Veo и Sora больше не нужны? Тестируем daVinci-MagiHuman - ServerFlow AI Lab - R&D в области ИИ и LLM
  • Ai动画224-化繁为简!daVinci-MagiHuman,快速音视频生成基础模型的单流架构,支持多国语言,音画同步,音色参考-T8 Comfyui教程 - T8star-Aix
  • New OpenSource Video Model, #1 Image generator, Seedance 2.0 Drop, replit and lovable in danger - AI Research

daVinci MagiHuman YouTube Videos

Watch community demonstrations and reviews showcasing daVinci MagiHuman's audio-video generation capabilities

daVinci MagiHuman Popular Reviews on X

See what people are saying about daVinci MagiHuman on X (Twitter)

映像と音声を同時生成のオープンソースモデル「daVinci-MagiHuman」が登場 ・OSS界隈ではトップクラスの性能 ・日中英韓独仏の6言語対応 ・音声認識誤り率14.6% クローズドのSeedance 2.0に対抗。デモの感じは精度が高そう H100で5秒間の1080p動画を38秒で生成したらしい

Reply

DaVinci-MagiHuman for ComfyUI. - 15B-param single-stream model runs in ~6GB VRAM via block-level swapping; - 8-step distillation; github.com/mjansrud/Comfy…

Wildminder
Wildminder
@wildmindai

daVinci-MagiHuman. We have another fast single-stream audio-video 15B foundation model by @SandAI_HQ > no separate pathways or cross-attention modules. > just raw self-attention doing all the heavy lifting. > wins 80% vs Ovi 1.1, 60% vs LTX 2.3; > native multilingual realistic

Reply

What's daVinci MagiHuman

Sand.ai's 15B open-source audio-video foundation model with best-in-class lip sync

15BParameters
1080pMax Resolution
7Languages Supported
2s256p Generation Speed

daVinci MagiHuman is a 15-billion parameter single-stream Transformer that jointly generates synchronized video and audio from text or images, achieving industry-leading lip sync accuracy with a 14.6% word error rate across 7 languages.

daVinci MagiHuman's Powerful Features

Discover the advanced capabilities that make daVinci MagiHuman exceptional for audio-video generation

Joint Audio-Video Generation

Generate synchronized video and audio in a single pass using a unified single-stream Transformer architecture with self-attention only, eliminating the need for separate audio pipelines.

Industry-Leading Lip Sync

Achieve 14.6% word error rate for lip synchronization, significantly outperforming competitors like Ovi 1.1 (40.45%) and LTX 2.3 (19.23%) in speech accuracy benchmarks.

7-Language Speech Support

Generate speech-synchronized videos in English, Chinese (Mandarin and Cantonese), Japanese, Korean, German, and French with natural pronunciation and lip movements.

Ultra-Fast Generation

Produce a 5-second 256p video in just 2 seconds on a single H100 GPU. 8-step DMD-2 distillation eliminates the need for classifier-free guidance without quality loss.

Dual Input Modes

Create videos from text prompts or animate still images. Both text-to-video and image-to-video modes support configurable aspect ratios, resolutions, and durations from 5 to 10 seconds.

Up to 1080p Super-Resolution

Generate videos at 256p, 540p, 720p, or 1080p through a latent-space super-resolution pipeline that upscales without extra VAE decode-encode overhead for efficient high-res output.

Open Source Apache 2.0

Fully open-sourced under Apache 2.0 license with complete stack including base weights, distilled model, super-resolution model, and inference code for unrestricted commercial use.

Human-Centric Excellence

Specialized in digital human generation with expressive facial performance, realistic body motion, and consistent character preservation across frames for professional talking-head content.

Frequently Asked Questions

Common questions about daVinci MagiHuman audio-video generation

daVinci MagiHuman supports two primary input modes: Text-to-Video (generate videos with synchronized audio from text prompts) and Image-to-Video (animate still images into motion videos with optional audio). Both modes support configurable aspect ratios (16:9 landscape, 9:16 portrait), resolutions up to 1080p, and durations from 5 to 10 seconds.
daVinci MagiHuman supports synchronized speech generation in 7 languages: English, Chinese (Mandarin), Cantonese, Japanese, Korean, German, and French. The model achieves a 14.6% word error rate for lip synchronization, significantly outperforming competitors like Ovi 1.1 (40.45% WER) and LTX 2.3 (19.23% WER).
daVinci MagiHuman supports multiple resolutions: 256p (fastest), 540p (super-resolution), 720p, and 1080p (super-resolution). Video duration can be configured from 5 to 10 seconds with 1-second granularity. Both landscape (16:9) and portrait (9:16) aspect ratios are supported.
On a single NVIDIA H100 GPU, daVinci MagiHuman generates a 5-second 256p video in approximately 2 seconds. For higher resolutions, generation times increase: 540p takes about 8 seconds and 1080p takes about 38.4 seconds for a 5-second video. This speed is achieved through 8-step DMD-2 distillation that eliminates classifier-free guidance.
Yes, daVinci MagiHuman is fully open-sourced under the Apache 2.0 license by Sand.ai and SII GAIR Lab. The complete stack is available including base model weights, distilled model, super-resolution model, and inference code, allowing unrestricted commercial use, modification, and distribution.
daVinci MagiHuman stands out with its unique single-stream Transformer architecture that uses self-attention only (no cross-attention or multi-stream paths), enabling joint audio-video generation in a single model. It achieves best-in-class lip sync accuracy (14.6% WER), supports 7 languages for speech, and an 80% win rate against Ovi 1.1 in human evaluation of visual quality.

How to Use daVinci MagiHuman Text to Video

Generate videos with synchronized audio from text descriptions

1
Write Your Prompt
2
Configure Settings
3
Generate Video

Enter a detailed description of the video you want to create. Include subject, action, speech content, and desired language for best lip-synced results.

How to Use daVinci MagiHuman Image to Video

Animate still images into videos with synchronized audio

1
Upload Your Image
2
Add Prompt and Settings
3
Generate Animated Video

Upload a reference image of the person or scene you want to animate. daVinci MagiHuman excels at human-centric content with realistic facial expressions and body motion.

Flexible AI Pricing

Pay-as-you-go credits or subscription plans. No hidden fees, cancel anytime.

Monthly billing

Free

Try before you buy

0
One Time
USD
Free
32points
Up to 3 videos
Up to 32 images
Multi-Model Support
Text to Video
Image to Video
Video to Video
Consistent Character
AI Animation Generator
Templates & Effects
AI Video Enhancers
Interactive Community
Faster Generation Speed
No-watermark Outputs
More Camera Movement
Private Video Visibility
Copy Protection
Priority Support
Popular

Pro

Elevate your AI experience

29.99
1 Month
USD
800
800points1 Month
Up to 80 videos1 Month
Up to 800 images1 Month
3 tasks(Parallel Tasks)
Multi-Model Support
Text to Video
Image to Video
Video to Video
Consistent Character
AI Animation Generator
Templates & Effects
AI Video Enhancers
Interactive Community
Faster Generation Speed
No-watermark Outputs
More Camera Movement
Private Video Visibility
Copy Protection
Priority Support

Lite

Start your AI journey

9.99
1 Month
USD
200points1 Month
Up to 20 videos1 Month
Up to 200 images1 Month
3 tasks(Parallel Tasks)
Multi-Model Support
Text to Video
Image to Video
Video to Video
Consistent Character
AI Animation Generator
Templates & Effects
AI Video Enhancers
Interactive Community
Faster Generation Speed
No-watermark Outputs
More Camera Movement
Private Video Visibility
Copy Protection
Priority Support