sDFT: Scaling Diffusion Field Transformers on Images, Videos, and 3D Data
Abstract
Diffusion Probabilistic Field (DPF) models the distribution of continuous functions defined over metric spaces. While DPF shows great potential for unifying data generation of various modalities including images, videos, and 3D geometry, it does not scale to a higher data resolution. This can be attributed to the scaling property, where it is difficult for the model to capture local structures through uniform sampling. To this end, we propose a new model comprising of a view-wise sampling algorithm to focus on local structure learning, and incorporating additional guidance, e.g., text description, to complement the global geometry. The model can be scaled to generate high-resolution data while unifying multiple modalities. Experimental results on data generation in various modalities demonstrate the effectiveness of our model, as well as its potential as a foundation framework for scalable modality-unified visual content generation.
Text-to-Video Generation (ours results generated by scaled up training on the webvid dataset)
Text-to-Face Video (Visual Comparisons)
More detailed text descriptions are used for inference.Prompt: She is blurry and young. This female begins with a sad expression, and she then is surprised, she eventually is sad. This woman closes eyes while singing for a moderate time.
Prompt: He has high cheekbones. This man starts with an expression of surprise, and he then has an expression of disgust, he eventually has an expression of surprise. This man talks meanwhile wagging head for a long time.
Prompt: He is young. He has blond hair. This man begins with a disgusted expression, and next he has a disgusted face, and then he is angry, and hhe then turns into a disgusted face and afterwards he turns happy, and afterwards he turns disgusted, and he then turns into a disgusted face, and he then turns into a angry expression, and he then has a disgusted expression and later on he turns into a angry face, and next he is disgusted, and he then turns into a angry face, and later on he turns into a disgusted face, and he then is angry, in the end, he turns into a disgusted expression. The male talks, frowns, while shaking head for a long time.
Prompt: She is young. At the beginning, this female is surprised, and afterwards she turns into a happy expression, she finally turns surprised. She first nods and talks at the same time for some time, next she laughs for a moderate time.
Prompt: This person has sideburns and beard. He is wearing goatee. Firstly, he gazes for a short time, and he then gazes for a short time, then he blinks for a short time, he finally blinks for a short time.
Prompt: She has wavy hair and high cheekbones. To begin with, this female talks for a short time, and she then talks for a short time, next she talks for a short time, in the end, she talks for a short time.
Prompt: A male is young. He has a big nose and high cheekbones. The man begins to talks for a short time, then he turns for a moderate time, in the end, he turns for a moderate time.
3D Object Generation (Visual Comparisons)
Acknowledgements
The website template was borrowed from Mip-NeRF 360.