sDFT: Scaling Diffusion Field Transformers on Images, Videos, and 3D Data

Abstract

Diffusion Probabilistic Field (DPF) models the distribution of continuous functions defined over metric spaces. While DPF shows great potential for unifying data generation of various modalities including images, videos, and 3D geometry, it does not scale to a higher data resolution. This can be attributed to the scaling property, where it is difficult for the model to capture local structures through uniform sampling. To this end, we propose a new model comprising of a view-wise sampling algorithm to focus on local structure learning, and incorporating additional guidance, e.g., text description, to complement the global geometry. The model can be scaled to generate high-resolution data while unifying multiple modalities. Experimental results on data generation in various modalities demonstrate the effectiveness of our model, as well as its potential as a foundation framework for scalable modality-unified visual content generation.

Text-to-Video Generation (ours results generated by scaled up training on the webvid dataset)


Prompt: Female violinist rehearsing with headphones at the microphone. 4k.
Prompt: Health, environment care for mother earth. the girl's hands are holding a tree sapling. growth and agriculture new life concept. plant and tree breeding. saving life. biological diversity of plants.
Prompt: Little girl plays in the children's room. the kid plays about and throws his things out of the box. daughter plays with clothes at home.

Prompt: Pile of old tvs and retro television with green screen. dolly out. green screen. 4k resolution.
Prompt: Sun light rays through under water's glittering. underwater scene full of bubbles up to sun.
Prompt: Hand in glove put covid 19 vaccination sign small shopping cart with vaccine ampoules on blue background soft focus.
Prompt: Star abstract retro tunnel loop neon glowing animation video template seamless loop.
Prompt: Hands of man placing components on pcb board. close up. zoom in. 4k resolution.
Prompt: Globe and mouse cursor on white background.

Text-to-Face Video (Visual Comparisons)

More detailed text descriptions are used for inference.

Prompt: She is blurry and young. This female begins with a sad expression, and she then is surprised, she eventually is sad. This woman closes eyes while singing for a moderate time.
VDM
CogVideo
sDFT (ours)

Prompt: He has high cheekbones. This man starts with an expression of surprise, and he then has an expression of disgust, he eventually has an expression of surprise. This man talks meanwhile wagging head for a long time.
VDM
CogVideo
sDFT (ours)

Prompt: He is young. He has blond hair. This man begins with a disgusted expression, and next he has a disgusted face, and then he is angry, and hhe then turns into a disgusted face and afterwards he turns happy, and afterwards he turns disgusted, and he then turns into a disgusted face, and he then turns into a angry expression, and he then has a disgusted expression and later on he turns into a angry face, and next he is disgusted, and he then turns into a angry face, and later on he turns into a disgusted face, and he then is angry, in the end, he turns into a disgusted expression. The male talks, frowns, while shaking head for a long time.
VDM
CogVideo
sDFT (ours)

Prompt: She is young. At the beginning, this female is surprised, and afterwards she turns into a happy expression, she finally turns surprised. She first nods and talks at the same time for some time, next she laughs for a moderate time.
VDM
CogVideo
sDFT (ours)

Prompt: This person has sideburns and beard. He is wearing goatee. Firstly, he gazes for a short time, and he then gazes for a short time, then he blinks for a short time, he finally blinks for a short time.
VDM
CogVideo
sDFT (ours)

Prompt: She has wavy hair and high cheekbones. To begin with, this female talks for a short time, and she then talks for a short time, next she talks for a short time, in the end, she talks for a short time.
VDM
CogVideo
sDFT (ours)

Prompt: A male is young. He has a big nose and high cheekbones. The man begins to talks for a short time, then he turns for a moderate time, in the end, he turns for a moderate time.
VDM
CogVideo
sDFT (ours)

3D Object Generation (Visual Comparisons)

GASP
GEM
sDFT (ours)

GASP
GEM
sDFT (ours)

GASP
GEM
sDFT (ours)

GASP
GEM
sDFT (ours)

GASP
GEM
sDFT (ours)

Acknowledgements

The website template was borrowed from Mip-NeRF 360.