Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

Last update: Dec 28, 2022

Overview

NÜWA - Pytorch (wip)

Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch. This repository will be populated in the case that Microsoft does not open source the code by end of December. It may also contain an extension into video and audio, using a dual decoder approach.

DeepReader

Citations

@misc{wu2021nuwa,
    title   = {N\"UWA: Visual Synthesis Pre-training for Neural visUal World creAtion}, 
    author  = {Chenfei Wu and Jian Liang and Lei Ji and Fan Yang and Yuejian Fang and Daxin Jiang and Nan Duan},
    year    = {2021},
    eprint  = {2111.12417},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Comments

Question about generated videos?

There are a lot of negative numbers and very small decimals (like 5e-1). But the loss degrades normally when training. Is that a normal situation? How can I make the result visible?

opened by Fitzwong 0
Why the video does not pass through the encoder?

Hi! lucidrains. Thanks for providing a great repo which is convenient to understand the NUWA paper.
I have a question as follows: In the NUWA paper, we can see that the inputs of the Encoder are caption tokens (caption condition) and the video tokens (3DNA condition). So, in my eye, the video tokens sequence should fully self-attend in the Encoder, right? And then, the outputs condition the Decoder. The Decoder provided by you is as following. . It has causal self-attention and text-condition as we expected. But from the definition in paper, the condition contains the text-condition and 3DNA condition, and these two condition the Decoder. Is my opinion right? I am just curious about the condition in the NUWA paper. The Encoder in your repo is only the Text-Encoder, but the video does not pass through the encoder to condition the Encoder.

Looking forward to your reply! Thanks!

opened by Wang-Xiaodong1899 0
Questions about function forward() in NUWA please.
I'm confused me that, in function forward() of class NUWA, the ground-truth video is fed to transformer and calculate the output video, which is different from function generate().

frame_embeddings = self.video_transformer( frame_embeddings, # calculated from ground-truth video context = text_embeds, context_mask = text_mask )

So when training NUWA, the loss comes from logits. But the logits are not only from text, but ground-truth video (only one transformer layer, different from the auto-regressive model in generate function). Is that some kind of cheating when training? Or should I generate logits in the same way as in generate(), and then calculate loss to train?
opened by Fitzwong 1
Type of dataset for training VQ-GAN

Hi,

First, thanks a lot for the amazing work! I have one question regarding the training of the VQ-GAN, do you recommend training it on a dataset similar to the dataset the nuwa model will be trained? What I mean is, if I want to train nuwa to generate sport videos based on text, do I need to also train the VQ-GAN on a sport dataset?

Thanks a lot

opened by antonibigata 0
Pseudocode for 3DNA?

me no comprendai le complex einops 😢

Can someone give the 3DNA pseudocode to illustrate what's going on 🤗

(Also how did lucidrains bang out thousands of lines of code in a few weeks - is he confirmed to be human? 🤔)

opened by neel04 4

Releases(0.7.7a)

0.7.7a(Aug 14, 2022)

null
Source code(tar.gz)
Source code(zip)
0.7.7(Aug 14, 2022)

null
Source code(tar.gz)
Source code(zip)
0.7.6(Apr 28, 2022)

Source code(tar.gz)
Source code(zip)
0.7.5(Apr 28, 2022)

Source code(tar.gz)
Source code(zip)
0.7.4(Apr 27, 2022)

Source code(tar.gz)
Source code(zip)
0.7.3(Apr 22, 2022)

Source code(tar.gz)
Source code(zip)
0.7.2(Apr 7, 2022)

Source code(tar.gz)
Source code(zip)
0.7.1(Mar 24, 2022)

Source code(tar.gz)
Source code(zip)
0.7.0(Mar 24, 2022)

Source code(tar.gz)
Source code(zip)
0.6.4(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
0.6.3(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
0.6.2(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
0.6.1(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
0.6.0(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
0.5.15(Mar 12, 2022)

Source code(tar.gz)
Source code(zip)
0.5.14(Mar 12, 2022)

Source code(tar.gz)
Source code(zip)
0.5.12(Mar 12, 2022)

Source code(tar.gz)
Source code(zip)
0.5.11(Mar 12, 2022)

Source code(tar.gz)
Source code(zip)
0.5.10(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.9(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.8(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.7(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.6(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.5(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.4(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.3(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)
0.5.2(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)
0.5.1(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)
0.5.0(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)
0.4.33(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

Adaptive Denoising Training (ADT) for Recommendation.

DenoisingRec Adaptive Denoising Training for Recommendation. This is the pytorch implementation of our paper at WSDM 2021: Denoising Implicit Feedback

51 Dec 30, 2022

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Text-AutoAugment (TAA) This repository contains the code for our paper Text AutoAugment: Learning Compositional Augmentation Policy for Text Classific

105 Jan 03, 2023

MPI-IS Mesh Processing Library

Perceiving Systems Mesh Package This package contains core functions for manipulating meshes and visualizing them. It requires Python 3.5+ and is supp

494 Jan 06, 2023

The ARCA23K baseline system

ARCA23K Baseline System This is the source code for the baseline system associated with the ARCA23K dataset. Details about ARCA23K and the baseline sy

4 Jul 02, 2022

《LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classiﬁcation》(AAAI 2021) GitHub:

LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classiﬁcation

76 Dec 05, 2022

Code for Towards Streaming Perception (ECCV 2020) :car:

sAP — Code for Towards Streaming Perception ECCV Best Paper Honorable Mention Award Feb 2021: Announcing the Streaming Perception Challenge (CVPR 2021

85 Dec 22, 2022

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

114 Dec 12, 2022

Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

Related tags

Overview

NÜWA - Pytorch (wip)

Citations

Comments

Question about generated videos?

Why the video does not pass through the encoder?

Questions about function forward() in NUWA please.

Type of dataset for training VQ-GAN

Pseudocode for 3DNA?

Releases(0.7.7a)

0.7.7a(Aug 14, 2022)

0.7.7(Aug 14, 2022)

0.7.6(Apr 28, 2022)

0.7.5(Apr 28, 2022)

0.7.4(Apr 27, 2022)

0.7.3(Apr 22, 2022)

0.7.2(Apr 7, 2022)

0.7.1(Mar 24, 2022)

0.7.0(Mar 24, 2022)

0.6.4(Mar 15, 2022)

0.6.3(Mar 15, 2022)

0.6.2(Mar 15, 2022)

0.6.1(Mar 15, 2022)

0.6.0(Mar 15, 2022)

0.5.15(Mar 12, 2022)

0.5.14(Mar 12, 2022)

0.5.12(Mar 12, 2022)

0.5.11(Mar 12, 2022)

0.5.10(Mar 11, 2022)

0.5.9(Mar 11, 2022)

0.5.8(Mar 11, 2022)

0.5.7(Mar 11, 2022)

0.5.6(Mar 11, 2022)

0.5.5(Mar 11, 2022)

0.5.4(Mar 11, 2022)

0.5.3(Mar 10, 2022)

0.5.2(Mar 10, 2022)

0.5.1(Mar 10, 2022)

0.5.0(Mar 10, 2022)

0.4.33(Mar 10, 2022)

Owner

Phil Wang

Adaptive Denoising Training (ADT) for Recommendation.

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

MPI-IS Mesh Processing Library

The ARCA23K baseline system

《LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classiﬁcation》(AAAI 2021) GitHub:

Code for Towards Streaming Perception (ECCV 2020) :car:

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

Training vision models with full-batch gradient descent and regularization

Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing

Pytorch implementation of Bert and Pals: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

Transfer Learning library for Deep Neural Networks.

Implementation of OpenAI paper with Simple Noise Scale on Fastai V2

Awesome Remote Sensing Toolkit based on PaddlePaddle.

🔥🔥High-Performance Face Recognition Library on PaddlePaddle & PyTorch🔥🔥

Really awesome semantic segmentation

A python implementation of Yolov5 to detect fire or smoke in the wild in Jetson Xavier nx and Jetson nano

Bayesian inference for Permuton-induced Chinese Restaurant Process (NeurIPS2021).

PyTorch Implementation for "ForkGAN with SIngle Rainy NIght Images: Leveraging the RumiGAN to See into the Rainy Night"

Python package to generate image embeddings with CLIP without PyTorch/TensorFlow

A parametric soroban written with CADQuery.