Official code release for "Learned Spatial Representations for Few-shot Talking-Head Synthesis" ICCV 2021

Related tags

Deep Learninglsr
Overview

LSR: Learned Spatial Representations for Few-shot Talking-Head Synthesis

Official code release for LSR. For technical details, please refer to:

Learned Spatial Representations for Few-shot Talking Head Synthesis.
Moustafa Meshry, Saksham Suri, Larry S. Davis, Abhinav Shrivastava
In International Conference on Computer Vision (ICCV), 2021.

Paper | Project page | Video

If you find this code useful, please consider citing:

@inproceedings{meshry2021step,
  title = {Learned Spatial Representations for Few-shot Talking-Head Synthesis},
  author = {Meshry, Moustafa and
          Suri, Saksham and
          Davis, Larry S. and
          Shrivastava, Abhinav},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),
  year = {2021}
}

Environment setup

The code was built using tensorflow 2.2.0, cuda 10.1.243, and cudnn v7.6.5, but should be compatible with more recent tensorflow releases and cuda versions. To set up a virtual environement for the code, follow the following instructions.

  • Create a new conda environment
conda create -n lsr python=3.6
  • Activate the lsr environment
conda activate lsr
  • Set up the prerequisites
pip install -r requirements.txt

Run a pre-trained model

  • Download our pretrained model and extract to ./_trained_models/meta_learning
  • To run the inference for a test identity, execute the following command:
python main.py \
    --train_dir=_trained_models/meta_learning \
    --run_mode=infer \
    --K=1 \
    --source_subject_dir=_datasets/sample_fsth_eval_subset_processed/train/id00017/OLguY5ofUrY/combined \
    --driver_subject_dir=_datasets/sample_fsth_eval_subset_processed/test/id00017/OLguY5ofUrY/combined \
    --few_shot_finetuning=false 

where --K specifies the number of few-shot inputs, --few_shot_finetuning specifies whether or not to fine-tune the meta-learned model using the the K-shot inputs, and --source_subject_dir and --driver_subject_dir specify the source identity and driver sequence data respectively. Each output image contains a tuple of 5 images represeting the following (concatenated along the width):

  • The input facial landmarks for the target view.
  • The output discrete layout of our model, visualized in RGB.
  • The oracle segmentation map using an off-the-shelf segmentation model (i.e. the pesuedo ground truth), visualized in RGB.
  • The final output of our model.
  • The ground truth image of the driver subject.

A sample tuple is shown below.

        Input landmarks             Output spatial map           Oracle segmentation                     Output                           Ground truth


Test data and pre-computed outupts

Our model is trained on the train split of the VoxCeleb2 dataset. The data used for evaluation is adopted from the "Few-Shot Adversarial Learning of Realistic Neural Talking Head Models" paper (Zakharov et. al, 2019), and can be downloaded from the link provided by the authors of the aforementioned paper.

The test data contains 1600 images of 50 test identities (not seen by the model during training). Each identity has 32 input frames + 32 hold-out frames. The K-shot inputs to the model are uniformly sampled from the 32 input set. If the subject finetuning is turned on, then the model is finetuned on the K-shot inputs. The 32 hold-out frames are never shown to the finetuned model. For more details about the test data, refer to the aforementioned paper (and our paper). To facilitate comparison to our method, we provide a link with our pre-computed outputs of the test subset for K={1, 4, 8, 32} and for both the subject-agnostic (meta-learned) and subject-finetuned models. For more details, please refer to the README file associated with the released outputs. Alternatively, you can run our pre-trained model on your own data or re-train our model by following the instructions for training, inference and dataset preparation.

Dataset pre-processing

The dataset preprocessing has the following steps:

  1. Facial landmark generation
  2. Face parsing
  3. Converting the VoxCeleb2 dataset to tfrecords (for training).

We provide details for each of these steps.

Facial Landmark Generation

  1. data_dir: Path to a directory containing data to be processed.
  2. output_dir: Path to the output directory where the processed data should be saved.
  3. k: Sampling rate for frames from video (Default is set to 10)
  4. mode: The mode can be set to images or videos depending on whether the input data is video files or already extracted frames.

Here are example commands that process the sample data provided with this repository:

Note: Make sure the folders only contain the videos or images that are to be processed.

  • Generate facial landmarks for sample VoxCeleb2 test videos.
python preprocessing/landmarks/release_landmark.py \
    --data_dir=_datasets/sample_test_videos \
    --output_dir=_datasets/sample_test_videos_processed \
    --mode=videos

To process the full dev and test subsets of the VoxCeleb2 dataset, run the above command twice while setting the --data_dir to point to the downloaded dev and test splits respectively.

  • Generate facial landmarks for the train portion of the sample evaluation subset.
python preprocessing/landmarks/release_landmark.py \
    --data_dir=_datasets/sample_fsth_eval_subset/train \
    --output_dir=_datasets/sample_fsth_eval_subset_processed/train \
    --mode=images
  • Generate facial landmarks for the test portion of the sample evaluation subset.
python preprocessing/landmarks/release_landmark.py \
    --data_dir=_datasets/sample_fsth_eval_subset/test \
    --output_dir=_datasets/sample_fsth_eval_subset_processed/test \
    --mode images

To process the full evaluation subset, download the evaluation subset, and run the above commands on the train and test portions of it.

Facial Parsing

The facial parsing step generates the oracle segmentation maps. It uses face parser of the CelebAMask-HQ github repository

To set it up follow the instructions below, and refer to instructions in the CelebAMask-HQ github repository for guidance.

mkdir third_party
git clone https://github.com/switchablenorms/CelebAMask-HQ.git third_party
cp preprocessing/segmentation/* third_party/face_parsing/.

To process the sample data provided with this repository, run the following commands.

  • Generate oracle segmentations for sample VoxCeleb2 videos.
python -u third_party/face_parsing/generate_oracle_segmentations.py \
    --batch_size=1 \
    --test_image_path=_datasets/sample_test_videos_processed
  • Generate oracle segmentations for the train portion of the sample evaluation subset.
python -u third_party/face_parsing/generate_oracle_segmentations.py \
    --batch_size=1 \
    --test_image_path=_datasets/sample_fsth_eval_subset_processed/train
  • Generate oracle segmentations for the test portion of the sample evaluation subset.
python -u third_party/face_parsing/generate_oracle_segmentations.py \
    --batch_size=1 \
    --test_image_path=_datasets/sample_fsth_eval_subset_processed/test

Converting VoxCeleb2 to tfrecords.

To re-train our model, you'll need to export the VoxCeleb2 dataset to a TF-record format. After downloading the VoxCeleb2 dataset and generating the facial landmarks and segmentations for it, run the following commands to export them to tfrecods.

python data/export_voxceleb_to_tfrecords.py \
  --dataset_parent_dir=
   
     \
  --output_parent_dir=
    
      \
  --subset=dev \
  --num_shards=1000

    
   

For example, the command to convert the sample data provided with this repository is

python data/export_voxceleb_to_tfrecords.py \
  --dataset_parent_dir=_datasets/sample_fsth_eval_subset_processed \
  --output_parent_dir=_datasets/sample_fsth_eval_subset_processed/tfrecords \
  --subset=test \
  --num_shards=1

Training

Training consists of two stages: first, we bootstrap the training of the layout generator by training it to predict a segmentation map for the target view. Second, we turn off the semantic segmentation loss and train our full pipeline. Our code assumes the training data in a tfrecord format (see previous instructions for dataset preparation).

After you have generated the dev and test tfrecords of the VoxCeleb2 dataset, you can run the training as follows:

  • run the layout pre-training step: execute the following command
sh scripts/train_lsr_pretrain.sh
  • train the full pipeline: after the pre-training is complete, run the following command
sh scripts/train_lsr_meta_learning.sh

Please, refer to the training scripts for details about different training configurations and how to set the correct flags for your training data.

Owner
Moustafa Meshry
Moustafa Meshry
code for "Self-supervised edge features for improved Graph Neural Network training",

Self-supervised edge features for improved Graph Neural Network training Data availability: Here is a link to the raw data for the organoids dataset.

Neal Ravindra 23 Dec 02, 2022
Numerical differential equation solvers in JAX. Autodifferentiable and GPU-capable.

Diffrax Numerical differential equation solvers in JAX. Autodifferentiable and GPU-capable. Diffrax is a JAX-based library providing numerical differe

Patrick Kidger 717 Jan 09, 2023
Unrolled Generative Adversarial Networks

Unrolled Generative Adversarial Networks Luke Metz, Ben Poole, David Pfau, Jascha Sohl-Dickstein arxiv:1611.02163 This repo contains an example notebo

Ben Poole 292 Dec 06, 2022
Super-BPD: Super Boundary-to-Pixel Direction for Fast Image Segmentation (CVPR 2020)

Super-BPD for Fast Image Segmentation (CVPR 2020) Introduction We propose direction-based super-BPD, an alternative to superpixel, for fast generic im

189 Dec 07, 2022
Implementation of paper "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal"

Patch-wise Adversarial Removal Implementation of paper "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal

4 Oct 12, 2022
Language Used: Python . Made in Jupyter(Anaconda) notebook.

FACE-DETECTION-ATTENDENCE-SYSTEM Made in Jupyter(Anaconda) notebook. Language Used: Python Steps to perform before running the program : Install Anaco

1 Jan 12, 2022
You Only Look Once for Panopitic Driving Perception

You Only 👀 Once for Panoptic 🚗 Perception You Only Look at Once for Panoptic driving Perception by Dong Wu, Manwen Liao, Weitian Zhang, Xinggang Wan

Hust Visual Learning Team 1.4k Jan 04, 2023
Code for our CVPR 2021 paper "MetaCam+DSCE"

Joint Noise-Tolerant Learning and Meta Camera Shift Adaptation for Unsupervised Person Re-Identification (CVPR'21) Introduction Code for our CVPR 2021

FlyingRoastDuck 59 Oct 31, 2022
Spectral normalization (SN) is a widely-used technique for improving the stability and sample quality of Generative Adversarial Networks (GANs)

Why Spectral Normalization Stabilizes GANs: Analysis and Improvements [paper (NeurIPS 2021)] [paper (arXiv)] [code] Authors: Zinan Lin, Vyas Sekar, Gi

Zinan Lin 32 Dec 16, 2022
Application of K-means algorithm on a music dataset after a dimensionality reduction with PCA

PCA for dimensionality reduction combined with Kmeans Goal The Goal of this notebook is to apply a dimensionality reduction on a big dataset in order

Arturo Ghinassi 0 Sep 17, 2022
This is the official implementation of Elaborative Rehearsal for Zero-shot Action Recognition (ICCV2021)

Elaborative Rehearsal for Zero-shot Action Recognition This is an official implementation of: Shizhe Chen and Dong Huang, Elaborative Rehearsal for Ze

DeLightCMU 26 Sep 24, 2022
Wider-Yolo Kütüphanesi ile Yüz Tespit Uygulamanı Yap

WIDER-YOLO : Yüz Tespit Uygulaması Yap Wider-Yolo Kütüphanesinin Kullanımı 1. Wider Face Veri Setini İndir Train Dataset Val Dataset Test Dataset Not:

Kadir Nar 6 Aug 22, 2022
PyTorch implementation of "VRT: A Video Restoration Transformer"

VRT: A Video Restoration Transformer Jingyun Liang, Jiezhang Cao, Yuchen Fan, Kai Zhang, Rakesh Ranjan, Yawei Li, Radu Timofte, Luc Van Gool Computer

Jingyun Liang 837 Jan 09, 2023
Implementation of Shape Generation and Completion Through Point-Voxel Diffusion

Shape Generation and Completion Through Point-Voxel Diffusion Project | Paper Implementation of Shape Generation and Completion Through Point-Voxel Di

Linqi Zhou 103 Dec 29, 2022
Code for the paper "Controllable Video Captioning with an Exemplar Sentence"

SMCG Code for the paper "Controllable Video Captioning with an Exemplar Sentence" Introduction We investigate a novel and challenging task, namely con

10 Dec 04, 2022
PyTea: PyTorch Tensor shape error analyzer

PyTea: PyTorch Tensor Shape Error Analyzer paper project page Requirements node.js = 12.x python = 3.8 z3-solver = 4.8 How to install and use # ins

ROPAS Lab. 240 Jan 02, 2023
Only valid pull requests will be allowed. Use python only and readme changes will not be accepted.

❌ This repo is excluded from hacktoberfest This repo is for python beginners and contains lot of beginner python projects for practice. You can also s

Prajjwal Pathak 50 Dec 28, 2022
Angular & Electron desktop UI framework. Angular components for native looking and behaving macOS desktop UI (Electron/Web)

Angular Desktop UI This is a collection for native desktop like user interface components in Angular, especially useful for Electron apps. It starts w

Marc J. Schmidt 49 Dec 22, 2022
ISNAS-DIP: Image Specific Neural Architecture Search for Deep Image Prior [CVPR 2022]

ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior (CVPR 2022) Metin Ersin Arican*, Ozgur Kara*, Gustav Bredell, Ender Konukogl

Özgür Kara 24 Dec 18, 2022
A project to make Amazon Echo respond to sign language using your webcam

Making Alexa respond to Sign Language using Tensorflow.js Try the live demo Read the Blog Post on Tensorflow's Blog Coming Soon Watch the video This p

Abhishek Singh 444 Jan 03, 2023